Evaluation of the impact of artificial intelligence-assisted image interpretation on the diagnostic performance of clinicians in identifying pneumothoraces on plain chest X-ray: a multi-case multi-reader study

WHAT IS ALREADY KNOWN ON THIS TOPIC

Artificial intelligence (AI)-assisted image interpretation algorithms can be used to accurately detect the presence of pathological findings on retrospective image datasets and improve radiologist performance, but their impact on frontline clinicians is unclear.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

This study provides evidence that AI-assisted image interpretation may be used to improve the diagnostic performance of clinicians, especially those more junior, in detecting pathological findings such as PTX.

Introduction

Artificial intelligence (AI) has transformative potential for healthcare. By far the most promising and mature use-case for AI in clinical practice is in AI-assisted image interpretation, which has accounted for a significant majority of academic publications and FDA certifications,1 2 and remains a rapidly developing area of innovation.3 4 Much AI imaging research has focused on the development, validation and evaluation of algorithms as measured against the diagnostic performance of senior radiologist readers.5–9 However, in the acute healthcare setting, many diagnostic images are typically reviewed and acted on by non-radiologists with varying degrees of clinical experience and expertise.10 The potential impact of AI-assisted image interpretation on the diagnostic accuracy of clinicians who are directly involved in interpreting images and delivering care to patients based on their findings in routine clinical practice therefore remains an important research question, and studies have begun to demonstrate potential benefits in this regard in an emergency medicine context.11

Recent guidance from NICE12 and AI-specific reporting guidelines have emphasised the importance of conducting evaluations in the clinical context in which they are likely to be cited, including feedback on usability and confidence directly from the intended users.13–17

Aims

To measure the diagnostic accuracy of the pneumothorax (PTX) detection facility of GEHC’s CCS application against an independent reference standard and assess its impact on the reporting performance of clinician groups routinely involved in the diagnosis and management of PTXs.

MethodsDesign

A multicentre cohort multi-case multi-reader study was conducted between 6 October 2021 and 27 January 2022 (figure 1).

Figure 1

PTX reader study activity flowchart. PTX, pneumothorax.

The Critical Care Suite, an FDA-approved and CE-marked suite of AI-based applications from GE, includes an algorithm for detecting and localising PTX on frontal CXRs. The output is presented to the clinician in the form of a ‘heatmap’ localising the area on the CXR image with a detected pneumothorax and a corresponding confidence score. This is accompanied by a variable cut-off threshold for the confidence score that determines the level at which the algorithm identifies an image as positive for the presence of PTX (online supplemental figure 1).18

We evaluated the impact of this AI-assisted image interpretation algorithm on the diagnostic performance and confidence of a range of clinicians involved in identifying and acting on PTXs found on plain chest radiography in routine clinical practice. In lieu of awaited AI-specific guidelines, this manuscript follows the STARD reporting guidelines for studies evaluating the performance of diagnostic tests in clinical practice.19 20

Image dataset

A total of 400 retrospectively collected and de-identified CXR images of patients aged 18 years or older were identified by a research radiographer by searching the radiology database from Oxford University Hospitals and subsequently curated into the project via the National Consortium of Intelligent Medical Imaging (NCIMI) databank (online supplemental table 1). 200 images positive for PTX were selected via radiology reports, with 50 of the following subtypes identified from note review and/or radiology request:

Primary PTX (no underlying lung disease).

Secondary PTX (underlying lung disease, eg, COPD).

Iatrogenic PTX (eg, cardiothoracic surgery).

Traumatic PTX.

Difficulty scoring

All images with positive PTXs were assigned a ‘Difficulty Score’ by a consultant radiologist and a senior radiographer (online supplemental table 2), considering four contributing factors, specifically:

Size of PTX (large is considered >0.5 cm at any point).

Patient-specific factors that is, kyphosis/obesity.

Image quality, specifically exposure factors and image penetration.

Presence of foreign bodies, artefacts and other pathological findings on the image.

Images without PTXs were also allocated a ‘Difficulty Score’, excluding the ‘PTX Size’ parameter.

Ground truth

Readings by senior consultant thoracic radiologists from Royal Cornwall, Greater Glasgow and Clyde and Oxford University Hospitals determined ground truth (gold standard equivalent). All CXR images were independently reviewed first by two radiologists for the presence/absence of PTX using a web-based DICOM viewer (www.raiqc.com). Images were annotated based on the presence or absence of a PTX, and if deemed present, a region of interest (ROI) was applied covering the entire PTX. In cases where there was discordance between the two radiologists’ interpretation, arbitration was undertaken by the third radiologist.

Participants

To evaluate the impact of the algorithm on human diagnostic performance, an observer performance test was performed by a reader panel comprised of 18 clinicians at three levels of seniority: consultant/senior (>7 years’ experience), middle grade/registrar (4–7 years), junior (<4 years) equally derived from six different clinical specialties (Emergency Medicine, Acute General/Internal Medicine, Respiratory Medicine, Intensive Care, Cardiothoracic Surgery and Radiography), who were working across four NCIMI hospital sites via the Thames Valley Emergency Medicine Research Network (www.TaVERNresearch.org).

Reader phases

The study included two reader phases. After enrolment, readers were given access to a short online training module with a series of five test cases to familiarise them with the study and Report and Image Quality Control (RAIQC) platform. They were then given access to four dedicated modules on the RAIQC platform (first reader phase) and asked to interpret the entire dataset over a period of 3 weeks, recording the perceived presence/absence of a PTX on each image and their degree of confidence on a four-point Likert scale. Readers were asked to localise any PTX identified by clicking on the ROI—the identification of a PTX would only be marked as ‘correct’ if they clicked within the area contoured during the ground truth process. Readers were given access to the clinical indication as entered on the original X-ray request form for each image but were blinded to the ground truth for each image, to the overall performance of the algorithm against ground truth and to the overall prevalence of PTX in the dataset (online supplemental figure 2).

The first phase was followed by a minimum 5 week ‘washout’ period, following which all readers re-interpreted the images with access to an extra version of the image depicting the AI output (online supplemental figure 3). Qualitative surveys based on the perceived utility and value of AI-assisted image interpretation were completed by all participants before and after each reader phase.

Readings for each phase were undertaken remotely online on a laptop or PC in a location of the participants’ choosing, in single or multiple sessions as preferred. The dataset was divided into four equally sized modules, and the sequence of the CXRs was randomised within each module for each individual reader during both phases.

Automated time measurements for the completion of each case were undertaken through a dedicated function in the RAIQC platform.

Statistical analysis

The primary outcome was the difference in diagnostic characteristics (sensitivity, specificity, accuracy) of readers without and with AI augmentation. Secondary outcomes included comparative analysis of performance by subgroups by medical profession, level of seniority and image difficulty score, including type, location and size of PTX, and time taken to complete the reads. The stand-alone algorithm performance set at varying sensitivity thresholds for labelling an image as positive for PTX was assessed by calculating the area under the curve of the receiver operating characteristic (ROC) and free-response receiver operating characteristic (FROC) curves plotted with their variance. To account for correlated errors arising from readers interpreting the same cases with and without AI, the Obuchowski and Rockette, Dorfman-Berbaum-Metz (OR-DBM) procedure, a modality-by-reader random effects analysis of variance model was used for estimation.21–23

Sample size

We used the tool ‘Multi-Reader Sample Size Program for Diagnostic Studies’24 to estimate power for the number of readers cases in our study (https://perception.lab.uiowa.edu/power-sample-size-estimation). For 18 readers, reading 400 cases yields 80% power to detect a difference in the accuracy of 10% with a type I error of 5%.24

Statistical analyses were all performed using R software (V.4.0.2; R Foundation for Statistical Computing). The significance threshold was set at two-sided 5% (p=0.05) for all secondary analyses.

Role of the funding sources

This study was funded by Innovate UK and by GE Healthcare. Research activity, including data analysis, and decision to publish, was conducted independently of the funders with the exception that AI inferencing of the CXR images was undertaken by GEHC.

ResultsDataset characteristics

Of the 200 cases positive for pneumothorax, five cases failed the pre-inferencing quality check and were removed from the image dataset, leaving a total of 395 radiographs. Six PTX-positive images were reclassified as negative following the radiologist review and labelling (ground truth), leaving 189 positive for PTX and 206 negative.

Algorithm performance

138 cases out of 189 positive cases were correctly identified by the AI algorithm as positive for pneumothorax, with 50 false negatives. 198 cases out of 206 negative cases were classified correctly (online supplemental figure 4). Analysis of the performance of the algorithm alone at default settings compared with the ground truth (thoracic radiologists) demonstrated a sensitivity of 0.73 (95% CI 0.66, 0.80) and a specificity of 0.96 (95% CI 0.92, 0.98), with a positive predictive value of 0.95 (95% CI 0.89, 0.98), and a negative predictive value of 0.79 (95% CI 0.74, 0.84), with an overall AUROC of 0.94 based on a variable sensitivity threshold (table 1) (online supplemental table 3, online supplemental figure 5).

Table 1

Diagnostic performance of algorithm performance vs ground truth

Reader performance

All participants interpreted all images with and without AI, generating a total of 14 220 image reads. Pooled analysis demonstrated an overall sensitivity of 66.8% (95% CI 57.3, 76.2) without AI increasing to 78.1% (95% CI 72.2, 84.0, p=0.002) with AI, and a non-statistically significant increase in specificity from 93.9% (95% CI 90.9, 97.0) without AI to 95.8% (95% CI 93.7, 97.9, p=0.247) with AI. Overall diagnostic performance characteristics for readers compared with ground truth with and without AI are summarised in table 2 and figure 2. Intraclass correlation (online supplemental table 4) improved significantly from 0.575 (95% CI 0.536, 0.615) unaided to 0.802 (95% CI 0.778, 0.825) aided. There was a marked and consistent improvement in pooled reader accuracy between phases 1 and 2 (online supplemental figure 6). Accuracy increased throughout the aided phase compared with the unaided phase, with decreased variation and an overall inclined gradient in the aided phase, reflected in the improved sensitivity scores in modules 3 and 4 (online supplemental table 5).

Figure 2

Overall unaided and aided reader diagnostic performance. AI, artificial intelligence.

Table 2

Overall sensitivity and specificity for readers compared with ground truth without and with AI

Individual reader performance

A summary of individual reader sensitivity with and without AI is presented in online supplemental table 6. 11 out of 18 readers showed improvement in sensitivity with AI, with 3 readers showing a small, non-statistically significant increase. Three senior readers (#12, 16 and 17) showed a non-statistically significant decrease in sensitivity, and one senior reader (#5) showed a small but statistically significant decrease in sensitivity with AI.

One junior reader (#9) did not adhere correctly to the technical specifications of the study, using hardware incompatible with the RAIQC platform. Due to this protocol deviation, they were asked to repeat both reading phases and the subsequent results were included in the final analysis. Post hoc sensitivity analysis demonstrated no change in the overall reader performance as a result of this update. The highest improvements in sensitivity were seen in those readers with the lowest unaided sensitivity (figure 3).

Figure 3

Improvement in sensitivity of individual readers conferred by AI assistance vs initial unaided sensitivity. AI, artificial intelligence.

Subgroup analyses

Across specialty groups, the highest performance and smallest increase in sensitivity was seen in the cardiothoracic surgery subgroup, with ITU clinicians showing the lowest performance and largest improvement; however, no specialty subgroup difference reached statistical significance (online supplemental table 7). A very large statistically significant increase in sensitivity 21.7% (95% CI 10.9 to 32.6), p<0.01) was seen in the aided junior reader subgroup compared with that seen in the middle 6.3% (95% CI −4.8 to 17.5, p=0.21) and senior 6.0% (95% CI −6.0 to 18.0, p=0.26) groups (figure 4). Aided junior grade readers increased in sensitivity to a comparable level with that of aided middle and senior grades and higher than that of unaided senior grades with a significant increase in intraclass correlation in all three groups (online supplemental table 8). Statistically significant increases in sensitivity with AI were seen across all types of PTX, and all PTX locations except medial, which showed a non-statistically significant increase (5.1%, 95% CI −3.7 to 13.8, p=0.25). Statistically significant increases in sensitivity were shown for X-rays of all levels of difficulty, though increases were non-statistically significant in the higher obesity and smaller PTX subcategories.

Figure 4

Reader sensitivity by seniority/grade. AI, artificial intelligence.

Reader reporting time

The effect of the use of the CCS PTX AI tool on reader analysis and reporting time is shown in online supplemental figure 7. In the absence of the AI tool, the mean reporting time by all 18 readers was 30.2 s per image (95% CI 24.4 to 37.4 s). while with the AI tool, reporting time took a mean of 26.9 s per image (95% CI 21.8 to 33.4 s; p>0.05 for the effect of AI tool). Extreme outlying results presumed to be due to technical failure (eg, leaving cases open on the browser between read sessions) were winsorised to prevent skewing of the data.

Self-reported confidence

Confidence in correctly interpreted images increased in the aided reader phase compared with the unaided reader phase. The proportion of ‘certain’ and ‘high’ confidence interpretations in the correct interpretation category (ie, true positives/true negatives) increased in the aided reader phase (online supplemental figure 8).

Discussion

This study evaluated the impact of an AI-assisted image interpretation on the diagnostic performance of a range of clinicians routinely involved in identifying PTX. Evaluation of the algorithm’s performance against a ‘ground truth’ reference standard of senior radiologist reporting showed high specificity (96%) and moderate sensitivity (73%), with an AUROC of 0.94. Overall group sensitivity improved by 11.4% from 66.8% to 78.1% (p=0.002), with specificity showing a non-significant improvement. Overall accuracy significantly improved between aided and unaided readers with increased intraclass correlation and an overall increase in the magnitude of accuracy improvements throughout the course of the aided reader phase. Clinicians demonstrated no overall increase in reporting time associated with the use of AI. Overall self-reported confidence in an accurate diagnosis improved, while confidence in an inaccurate diagnosis decreased. Subgroup analyses demonstrated increases in sensitivity across all reader and image subgroups, with a marked increase in sensitivity in the junior reader group (21.7%), which improved to a level (77.7%) comparable with that of the aided middle (77.2%) and senior (79.5%) groups, and greater than that of unaided senior groups (73.5%).

There are few existing studies that directly evaluate the impact of AI on the diagnostic performance of clinicians as opposed to radiologists, and none that evaluate the impact of AI-assisted PTX detection on frontline UK clinicians of varying seniority and specialty.5 9 25 A recent study evaluated the impact of an algorithm which detected four abnormalities on the reporting performance of radiologists of varying experience level. The algorithm in that study showed a 100% sensitivity for pneumothorax, although on a significantly smaller dataset (80 images). Readers showed a higher baseline sensitivity than in our study as might be expected given their skillset, though with a similarly large improvement in aided performance.5 It is unclear whether the dataset in that study was comparable in terms of complexity to the dataset used in this study. Other studies evaluating the impact of AI on fracture detection on plain radiographs and CT by clinicians have found comparable results to ours in terms of sensitivity and overall improvement.7 11 26

The large statistically significant improvement seen here appears to have been driven by marked increases in the diagnostic performance of less proficient and less experienced clinicians in the group. The AI tool increased the sensitivity of readers with lower AI-unassisted scores as opposed to those with higher AI-unassisted scores. Indeed, certain high-performing individuals whose unaided sensitivity was higher than that of the algorithm alone did not show a similar increase in accuracy, with some even decreasing their performance in the aided reader phase. This suggests both an important potential use case for AI-enhanced imaging in supporting the performance of less skilled clinicians and equally the need for caution and education of clinicians in terms of implementation. Overall sensitivity and all individual reader sensitivities, with one exception, were higher than the performance of the algorithm alone (73%), suggesting that readers overall appear to have been incorporating AI findings into their diagnostic reasoning rather than blindly following the algorithm, which addresses an important concern regarding the introduction of AI into clinical practice.27 28 The increase in confidence scores for correct interpretations may indicate that practitioners are likely to appropriately act on their findings when supported by AI. This would however need to be fully evaluated in a prospective study.

The introduction of new AI-based diagnostic technologies which can detect abnormalities in medical imaging may raise concerns regarding the potential for existing clinicians/radiologists to gradually lose existing interpreting skills or to rely solely on the findings of the algorithm.27 29 Our results however indicate the presence of a potential learning effect associated with access to AI-enhanced imaging. This may have significant implications for potential use cases and implementation strategies for similar technologies. Productivity was unaffected by the algorithm, with the time taken to report on each image being unaffected. Other studies have demonstrated similar findings.5 7

Strengths

This prospective study was based on a large combined dataset and reader group in contrast to many other reader studies, with a resultant large number of individual readings for analysis. Comparisons were based on a ground truth of three independent senior radiologist reports which is the current gold standard for similar studies.7 The study used a challenging dataset reflective of real-world complexity which was detailed and characterised, and readings were undertaken by a broad range of clinicians and radiographers of varying seniority from multiple hospital sites.

Limitations

This study used an artificially enriched dataset in terms of a high prevalence of PTX; this approach is commensurate with other similar studies.6 30 31 It was based on the detection of a single pathology, which limits the degree to which findings may be generalised to the interpretation of chest radiographs in the acute setting, though other pathologies were included in this study in order to attempt to distract and confound the readers and algorithm in an attempt to reflect ‘real-world’ practice. Each seniority level had only one representative from each specialty but the clinician pool was large, allowing pooled analysis results to be more generalisable than prior similar studies. It is also feasible that some degree of improvement in performance during the study may have been due to the experience from reporting large numbers of chest radiographs; however, there was a clear separation between the phases in terms of performance, as well as in an apparent increase in performance throughout the second phase not seen in the first. The participants in this study undertook the reader phases in artificial conditions, free from the distractions associated with a busy clinical shopfloor, and were informed that they were specifically looking for the presence of PTX—this may have resulted in their unaided accuracy being higher than might be encountered in real-world practice, potentially reducing the effect of the intervention in this study. Readers undertook the study using personal or institutional laptops and workstations and were not limited to PACS quality devices which may have reduced their accuracy. Many small PTX may be successfully managed conservatively and therefore possible that the detection of these may not translate directly into meaningful clinical impact. Equally, a false positive finding for a PTX may lead to unnecessary intervention. However, improved identification of PTX through the reduction of false negative results is likely to reduce re-presentation or the need for recall, along with the concomitant clinical governance burden and should ultimately improve patient and clinician experience and potentially reduce healthcare presentations—these aspects should be evaluated in future prospective clinical and health economic studies.

Implications

The findings suggest that AI-assisted image interpretation may significantly improve diagnostic performance and confidence in identifying PTX on chest radiographs, especially for junior or inexperienced clinicians with lower unassisted performance. However, it may adversely affect highly skilled senior clinicians, highlighting the need for appropriate user training before implementation to highlight the operational strengths and weaknesses of the algorithm. These findings may potentially generalise to other pathologies and imaging modalities if supported by comparably efficacious algorithms. The study also raised the possibility of a learning effect from repeated AI-assisted interpretation, which warrants further exploration.

Conclusions

This study demonstrates that AI-assisted image interpretation can potentially improve clinicians' performance in detecting pathologies like PTX, with the most marked improvement seen in less proficient individuals, without increasing interpretation time. It provides evidence that AI may support junior practitioners in non-specialist acute settings by improving diagnostic performance. However, definitive evidence and the magnitude of potential clinical and health economic benefits require further study.

Data availability statement

Data are available upon reasonable request. All datasets and documents related to this study currently reside securely in Oxford University Hospitals NHS Foundation Trust, and will be made available upon reasonable request to the corresponding author. The AI algorithm used in this research is a commercially available third-party product and as such the authors do not have sharing rights – enquires can be made via https://www.gehealthcare.com.

Ethics statementsPatient consent for publicationEthics approval

This study involves human participants but as per Health Research Agency guidance (https://www.hra-decisiontools.org.uk/research/), this study constitutes Observational Research. Clinical data was used under existing ethical governance framework between National Consortium of Intelligent Medical Imaging and Oxford University Hospitals NHS Foundation Trust. The collection and aggregation of the retrospective dataset was undertaken under an approved data protection impact assessment undertaken by Oxford Hospitals NHS Foundation Trust. Participants gave informed consent to participate in the study before taking part.

View original article

EMERGENCY MEDICINE JOURNAL

分享书签

0 0 0 0 0 0 0

More from this channel

Evaluation of the impact of artificial intelligence-assisted image interpretation on the diagnostic performance of clinicians in identifying pneumothoraces on plain chest X-ray: a multi-case multi-reader study

留言 (0)