Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography

Design

A fully crossed, complete, and balanced block design was used where all radiologists read the same 90 chest x-rays from 90 patients across four different conditions (No AI, AI results Keep (No Box), AI results Delete (No Box), AI results Keep (Box)) (Fig. 1). The order of conditions AI Keep (No Box) and AI Delete (No Box) were randomly counterbalanced between radiologists to reduce an order effect. No AI was always first and Keep AI (Box) was always last to ensure no contamination (i.e., the first condition was standard of care and provided no feedback to contaminate future conditions, the last condition pinpointed suspected pathology). The AI results were manipulated to provide 8 false positives (FPs) and 4 false negatives (FNs). It was correct for the remaining 78 cases (ROC-AUC of 0.87). Cases were ordered randomly. Radiologists were blinded to individual patient identifying information and reading sessions were separated by at least 1 month (average inter-session duration of included radiologists is 113.8 days).

Fig. 1figure 1

Study flow chart. Overall experimental procedure is displayed. The order of Keep AI (No Box) and Delete AI (No Box) was counterbalanced such that n = 3 participated in Keep AI (No Box) first, and n = 2 participated in Delete AI (No Box) first

Cover story

Radiologists were told that the goal of the study was to look at several different AI programs that have been developed for chest radiography. As such, radiologists were told they would interpret radiographs on several different trials. Radiologists were told they should indicate whether follow-up imaging was needed for a possible pulmonary lesion. Unbeknownst to the radiologists, the only pathology present was lung cancer. For one trial, radiologists were told they would not have AI (No AI) so they could serve as the control group for that AI system. For each of the other sessions, radiologists were told they would be interpreting cases with a unique AI program.

AI feedback

The AI results for each image were provided in two ways. First, either “Abnormal” or “Normal” was shown visually in the top right-hand corner of the image. Second, the experimenter read the AI results aloud to reinforce the Keep vs. Delete manipulation (Supplement 1). For the AI Keep (Box) condition, a box was placed when the AI result was “abnormal.” No such box was used for the other conditions. Because the effect of AI feedback implementation was of interest, AI feedback needed to be controlled. Thus, no real AI systems were actually used. The AI feedback for each image was chosen by MKA. This enabled us to control the number of true positives (TPs), true negatives (TNs), FPs, and FNs, and which cases fell into each of these categories. The AI result was identical between experimental conditions; thus, only the instructions regarding keep vs. delete vs. box differed, (Supplement 1).

Radiologists were told the ROC-AUC value of each system was higher than 0.85, which approximates clinical practice [3] and was true for the simulated AI systems.

Case selection

A total of 90 frontal chest radiographs (CXR) (50% positive for lung cancer) taken from patients imaged between 2013 and 2016 were used. Positive CXRs were obtained within 2 weeks of a CT demonstrating a single dominant lung nodule/mass confirmed at biopsy to be lung cancer and NO other airspace opacities, consolidations, or macronodules detectable at CXR; negative CXRs were obtained within 2 weeks of a CT demonstrating NO nodule/mass. More information is provided in supplement 2.

Procedure

Due to COVID-19, the experimenter and radiologist communicated over videoconferencing. A convenience sample comprising six radiologists consented and participated (E.H.D., A.W.P.M., A.R.K., S.A., R.C.W., T.T.H.) with 3, 1, 14, 6, 5, and 12 years of experience, respectively. At the appointed time, the radiologist and experimenter (M.H.B.) joined a videoconference. Radiologists were reminded about the ostensible goal of the experiment (evaluating several different AI programs). Radiologists were then told that the ROC-AUC for the company’s AI program in the present session was higher than 0.85 (sessions 2–4). To encourage effortful participation, radiologists were told that their performance relative to the other radiologists would be evaluated by M.H.B. and M.K.A. (Vice Chair of Research and thoracic radiologist) upon conclusion of the study (though there was no such evaluation in reality). Radiologists viewed a sample evaluation sheet. Radiologists were told that in order to fully evaluate each AI system, all images were taken from 55–75-year-old smokers who were outpatients, so the dataset was therefore enriched. In actuality, as noted above, the ratio of positive and negative cases was equal though radiologists were not told prevalence rate to avoid confounding their performance [18]. No history was provided for any of the images. Next, the radiologist interpreted each anonymized image one-by-one using an ORTHANC-based research PACS [19]. All image interpretation was conducted in a clinical radiology reading room using standard PACS monitors. Except for the manner in which cases were loaded from Orthanc, viewing conditions were identical to those found in clinical practice. The experimenter was blinded to ground truth. The procedure is shown in Fig. 2.

Fig. 2figure 2

Procedural overview. A brief description of the major procedural elements within each session is shown

Interpretation and outcomes

Radiologists responded to two questions for each image: (1) whether they recommend follow-up CT imaging (yes/no) and (2) their level of confidence (1 = not at all to 5 = very confident). All responses were read aloud and transcribed by the experimenter. Follow-up imaging, as a clinical decision, was used as the primary outcome. Thus, a failure to follow up a positive is a FN, not following up a negative is a TN, following up a positive is a TP, and following up a negative is a FP.

Debriefing and exclusion

Following the conclusion of the study, radiologists were debriefed, and all deception was explained. One radiologist was excluded from analyses, so analyses were performed on the remaining 5 radiologists. The excluded radiologist did not fully believe the cover story, and correctly thought the experimenter might have been interested in his/her behavior. He/she also reported being unaware of the keep versus delete manipulation. All procedures were approved by the Rhode Island Hospital IRB.

Statistical analysis

All analyses were conducted using SAS Software 9.4 (SAS Inc.). FPs and FNs were examined between conditions to test the hypotheses using generalized linear mixed modeling (GLMM) assuming a binary distribution. GLMM was used to examine diagnostic confidence between conditions. Alpha was established a priori at the 0.05 level and all interval estimates are calculated for 95% confidence. Analyses were conducted by the statistical author (G.L.B.).

Manipulations

The AI result manipulations (Keep vs. Delete; Box vs. No Box) were introduced at three points during the experiment: (1) at the beginning of the session during preliminary instructions, (2) when the experimenter provided oral AI results, and (3) during the sample evaluation sheet. These are detailed in Supplement 1. Briefly, in AI Delete (No Box), radiologists were told AI results would not be saved. In AI Keep (No Box) and AI Keep (Box), radiologists were told the results of AI would be saved. AI Keep (Box), unlike both other AI conditions, provided a box around suspicious regions when AI indicated it was abnormal.

留言 (0)

沒有登入
gif