Tomography, Vol. 8, Pages 2815-2827: Diagnostic Performance in Differentiating COVID-19 from Other Viral Pneumonias on CT Imaging: Multi-Reader Analysis Compared with an Artificial Intelligence-Based Model

1. IntroductionCoronavirus Disease 2019 (COVID-19) is a complex infectious disease caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), which has caused more than half a billion cases and 6 million deaths since it was first reported in late 2019 [1].From a radiological point of view, CT findings of SARS-CoV-2 pulmonary infection include ground-glass opacities, areas of crazy-paving pattern, and consolidations. Such alterations are usually multiple and bilateral, with patchy distribution and predominant involvement of basal and subpleural lung regions [2,3]. However, not all COVID-19 patients exhibit these characteristics, making differential diagnosis from other pulmonary diseases challenging [4,5]. In particular, the typical appearance of COVID-19 partially overlaps with the CT findings of different types of viral pneumonia, similar to those from adenovirus and rhinovirus [6]. This reduces the specificity of chest CT and raises the risk of false-positive diagnosis, especially in case of low incidence and prevalence of COVID-19 [7].To facilitate the evaluation of chest CT of patients with suspected lung involvement by SARS-CoV-2, the COVID-19 Reporting and Data System (CO-RADS) score was proposed [8]. This scheme provides a standardized five-point scale to express the suspicion of COVID-19 pneumonia on chest CT images, demonstrating excellent diagnostic performance and moderate-to-substantial interobserver agreement [9]. Nevertheless, the CO-RADS category 3, which accounts for equivocal findings, still implies positivity to COVID-19 in 20–40% of cases [8,10]. Given the need for efficient tools for the detection and differential diagnosis of COVID-19, there has been a considerable drive to develop solutions based on quantitative imaging, such as radiomics and artificial intelligence (AI) [11]. Many authors have pointed out the potential added value of AI models in differentiating COVID-19 from other types of pneumonia, with accuracy ranging from 80% to over 95% [12,13,14,15]. However, in most cases, the diagnostic performance of AI models was assessed by comparing COVID-19 with heterogenous pulmonary conditions, often including bacterial infections [15,16,17], whose distinct features can ease the classification task.

In this work, we designed a multi-reader study to assess the performance of a radiomics-based AI classifier in the radiological challenge of discriminating COVID-19 from other types of viral-only pneumonia with microbiologically established etiology. We also simulated two distinct suspicion scenarios to investigate the impact of the varying epidemiological conditions on diagnostic performance.

2. Materials and Methods 2.1. Study Design and Imaging Data

This study was retrospectively conducted in a single high-volume referral hospital for the management of the COVID-19 pandemic. The Local Ethics Committee (decision number 188-22042020) approved the study and waived informed consent since data were collected retrospectively and processed anonymously.

Chest CT scans of 1031 consecutive patients with a positive PCR nasopharyngeal swab for SARS-CoV-2 (COVID-19, n = 647) and other respiratory viruses (non-COVID-19, n = 384) were collected. The panel of non-COVID-19 viruses detected included: adenovirus, bocavirus 1/2/3/4, coronavirus 229E/NL63/OC43, enterovirus, influenza virus A/B viruses, metapneumovirus, parainfluenza virus 1/2/3/4, rhinovirus A/B/C, and respiratory syncytial virus A/B. Patients with evidence of bacterial coinfection in their clinical documentation were excluded.

The CT scans of COVID-19 patients were performed between March 2020 and April 2021, while CT scans of non-COVID-19 patients were performed between January 2015 and October 2019 (i.e., before SARS-CoV-2 started circulating). For both groups, the CT scans were acquired within 15 days of serological evidence of infection.

Chest CT examinations were performed with different CT scanners (Somatom Definition Edge—Siemens, Somatom Sensation 64—Siemens, Brilliance 64—Philips) and with the same patient set-up (supine position with arms over the head during a single breath-hold, in keeping with the patient compliance). The main acquisition parameters were: tube voltage = 80–140 kV; automatic tube current modulation; pitch = 1; matrix = 512 × 512. All acquisitions were reconstructed with high-resolution thorax kernels and a slice thickness of 3 mm.

2.2. Artificial Intelligence-Based ModelThe collected CT images were used to develop a radiomic-based Neural Network (R-AI) classifier exploiting a Multi-Layer Perceptron architecture to discriminate between COVID-19 and non-COVID-19 pneumonia. In particular, the classifier was trained with 811 CT images (n = 496 COVID-19, n = 315), while the remaining 220 CT images (n = 151 COVID-19, n = 69 non-COVID-19) were used as an independent validation dataset, applying a threshold on the predicted values of 0.5. Details about the R-AI classifier, including development and tuning, were previously described [18].

The R-AI classifier provided as output the probability (0.00–1.00) that the analyzed CT scan belonged to a COVID-19 patient.

2.3. Reader EvaluationThree radiologists with >10 years of experience (Readers 1–3) and one radiology resident with 3 years of experience (Reader 4), all employed at a high-volume COVID-19 referral hospital, were enrolled to evaluate the 220 CT scans of the independent validation dataset. The four readers were blinded to the original radiologic report and all non-imaging data, including the acquisition date of the CT scans. They were asked to assign each case the CO-RADS score [8] (1 to 5) to express the increasing suspicion of COVID-19. To properly simulate a realistic clinical scenario, the readers were instructed to interpret the CT findings, assuming that the patients had an acute condition (e.g., presentation at the Emergency Department).

Additionally, as an estimate of disease severity, for each patient, the readers visually assessed the extent of pulmonary involvement expressed as a percentage of the total lung volume, rounded to the nearest 10%.

The test was performed using a program developed in JavaScript that automatically opened to the reader the anonymized CT series in random order. After the reader had assigned the CO-RADS score through a dialog box, the program automatically loaded the CT of the next patient in random order.

2.4. Data AnalysisContinuous variables were reported as median values with 25th and 75th percentiles (Q1–Q3) of their distribution; categorical variables were expressed as counts and percentages, with the corresponding 95% confidence interval (95%CI) using the Wilson method [19].The chance-corrected inter-reader agreement for the assigned CO-RADS score was tested using Gwet’s second-order agreement coefficient (AC2) with ordinal weights [20]. AC2 was chosen to correct for the partial agreement occurring when comparing ordinal variables with multiple readers and because it is less affected by prevalence and marginal distribution [21,22,23]. The level of agreement was interpreted following Altman’s guidelines [24]. Weighted percentage agreement was reported as well [25].

To account for equivocal results (i.e., CO-RADS 3), two different scenarios were simulated: a high suspicion scenario, where CO-RADS 3 results were considered as COVID-19 patients, and a low suspicion scenario, where CO-RADS 3 results were considered as non-COVID-19 patients together with CO-RADS 1 and 2.

Sensitivity (SE), specificity (SP), accuracy (ACC), positive likelihood ratio (PLR), and negative likelihood ratio (NLR) of human readers in discriminating COVID-19 patients from non-COVID-19 patients were calculated for both high and low suspicion scenarios. The same metrics of diagnostic performance were also calculated for the R-AI classifier.

Moreover, a further subanalysis was conducted to compare the performance of human readers and the R-AI classifier in challenging cases when two or more readers had assigned a CO-RADS 3 score.

Significant differences in the diagnostic performance of the readers and the R-AI classifier were tested using Cochran’s Q test with a post-hoc pairwise McNemar test.

The data analysis was generated using the Real Statistics Resource Pack software (Release 6.8) (www.real-statistics.com (accessed on 1 October 2022)) for Microsoft Excel (Microsoft Corporation, Redmond, Washington, DC, USA) and GraphPad Prism 8.4.0 (GraphPad Software, La Jolla, CA, USA).

Statistical significance was established at the p < 0.050 level, applying Bonferroni’s correction for multiple comparisons when appropriate.

3. ResultsThe demographic characteristics of the patient population are reported in Table 1. Specifically, the 220 patients of the independent validation set consisted of 159 (72%) males and 61 (28%) females and had a median age of 68 (Q1–Q3: 59–78) years. Averaging between the different readers, the median extent of their pulmonary disease was 33% (Q1–Q3: 20–53%) of the total lung volume. Median interval between CT scans and molecular swabs was of 1 (Q1–Q3: 0–2) days for COVID-19 and 3 (Q1–Q3: 1–6) days for non-COVID-19 patients.CO-RADS scores assigned by each reader were detailed in Table 2. Considering the global performance of the four readers, the error rate was 17% (95%CI: 14–20%) in classifying patients as COVID-19 and 32% (95%CI: 27–38%) in classifying them as non-COVID-19. Notably, some discrepancies could be observed since Reader 3 tended to assign CO-RADS 2 score more frequently in both the COVID-19 and non-COVID-19 groups compared to the other readers. However, inter-reader agreement in assigning the CO-RADS score was good, with an ordinal-weighted AC2 of 0.71 (95%CI: 0.67–0.76; ppThe rate of patients classified as CO-RADS 1 (normal/noninfectious) was 10% (95%CI: 8–12%), while the rate of CO-RADS 3 (equivocal cases) was 17% (95%CI: 15–20%). Specifically, 43 (20%) cases received a CO-RADS 3 score from two or more readers, of which 26 (60% of 43) were COVID-19 patients and 17 (40% of 43) were non-COVID-19 patients. On the other hand, the R-AI classifier misclassified 21% (95%CI: 15–28%) of the COVID-19 patients and 22% (95%CI: 14–33%) of the non-COVID-19 patients. Exemplary cases are shown in Figure 1.Regarding the diagnostic performance in identifying COVID-19 pneumonia, full results are provided in Table 3, Figure 2 and Figure 3. Considering all the readers, SE = 83% (95%CI: 80–86%), SP = 66% (95%CI: 60–71%), ACC = 78% (95%CI: 75–80%), PLR = 2.35 (95%CI: 2.00–2.76), and NLR = 0.25 (95%CI: 0.21–0.30) were observed in the high suspicion scenario. On the other hand, SE = 68% (95%CI: 64–72%), SP = 88% (95%CI: 84–92%), ACC = 75% (95%CI: 72–77%), PLR = 5.70 (95%CI: 4.12–7.89), and NLR = 0.36 (95%CI: 0.32–0.41) were obtained in the low suspicion scenario.

When considering the R-AI classifier, it achieved SE = 79% (95%CI: 71–85%), SP = 78% (95%CI: 67–87%), ACC = 79% (95%CI: 73–84%), PLR = 3.63 (95%CI: 2.30–5.72), and NLR = 0.27 (95%CI: 0.19–0.38) in distinguishing COVID-19 from non-COVID-19 pneumonia on the validation dataset.

According to Cochran’s Q test, only the performance of Reader 3 significantly changed between the high and low suspicion scenarios, decreasing in the latter (accuracy 70% vs. 78%, p = 0.008); no significant changes were found for the other readers (p > 0.999). No significant differences in performance were observed between the readers and the R-AI classifier for the high suspicion scenario (p = 0.369); on the contrary, a statistically significant result was obtained for the low suspicion scenario (p = 0.003). However, the post-hoc pairwise McNemar test revealed that the R-AI classifier still had diagnostic performance comparable to that of human readers (lowest p = 0.256), whereas Reader 3 had a significantly lower performance than Reader 2 (p = 0.039) and Reader 4 (p = 0.041). Full statistical results of the comparative analysis are provided in Table 4.Finally, considering the subset of 43 CT scans to which two or more radiologists assigned a CO-RADS 3 score, the readers obtained a global accuracy of 55% (95%CI: 47–62%) in the high suspicion scenario and 45% (95%CI: 38–53%) in the low suspicion scenario, whereas the R-AI classifier showed an accuracy of 74% (95%CI: 59–86%). Cochran’s Q test was significant in both cases, with pp = 0.023 for both scenarios) and Reader 3 in the low suspicion scenario (p = 0.035). Full details are reported in Table 5 and Table 6 and Figure 4. 4. Discussion

In this study, the diagnostic performance of multiple readers in distinguishing between COVID-19 and non-COVID-19 pneumonia was evaluated in two different risk scenarios and compared with a radiomic-based artificial intelligence classifier.

Given the well-known complexity of the task, inter-reader agreement in assigning the CO-RADS score was assessed and found to be good, in line with the currently available literature on the reproducibility of this reporting system. Prokop et al. [8] initially observed an overall Fleiss’ kappa of 0.47, but subsequent studies reported a moderate-to-good level of agreement, comparable to that observed in our study [9,26]. Moreover, the absence of significant differences in the diagnostic performance of the three high-experience readers compared to the low-experience reader using the CO-RADS score confirmed the observations by Bellini et al. [10]. On the contrary, in our study, some inconsistency in CO-RADS evaluation was observed for one of the high-experience readers, whose diagnostic accuracy were slightly inferior in the low suspicion scenario.At the very beginning of the COVID-19 pandemic, a study [7] on 424 patients with COVID-19 and non-COVID-19 viral pneumonia yielded a classification accuracy ranging between 60% and 83% when considering radiologists with direct experience of SARS-CoV-2 infection. Such a wide range of accuracy was reported in subsequent multi-reader analyses [5,27,28], and the results of our study fell within it. The simulation of two different suspicion scenarios allowed us to account for diverse epidemiological conditions, thus providing a more complete picture of the diagnostic performance of the readers.When applied to the same dataset, the R-AI classifier obtained an accuracy of 79%, comparable to the performance of the human readers in both high and low suspicion scenarios. This result was similar to that reported by Cardobi et al. [29], who developed a radiomic-based model to distinguish COVID-19 from other types of interstitial pneumonia at chest CT. As we used a ten-time larger dataset and applied the R-AI classifier to an independent validation set, our study provided stronger evidence that quantitative imaging and AI models can support this diagnostic task.

Notably, when considering only the subset of patients who were assigned a CO-RADS 3 score by two or more radiologists, the global accuracy of the human readers dropped to 45–55% (depending on the scenario), while the accuracy of the R-AI classifier was almost unchanged (74%). This suggested a more stable performance for the AI, probably based on the extraction of quantitative information within medical images not perceivable by the human brain, even though the result was only partially confirmed by the post-hoc pairwise McNemar test. However, it is reasonable to believe that the smaller sample size, resulting in larger confidential intervals for performance metrics, and correction for multiple comparisons reduced the statistical power by increasing the risk for type II errors. Nevertheless, the result bolsters the concept of AI models helping with equivocal cases, for example, as a second opinion tool to improve diagnostic performance.

AI models with higher performance than our classifier in differentiating between COVID-19 and non-COVID-19 viral pneumonia were also reported, as in the study by Wang et al. [14,30]. However, these authors proposed a method based on single-slice manual segmentation of pulmonary lesions, which is a time-consuming approach hardly feasible in everyday clinical practice compared with our fully automatic approach. Zhou et al. [13] provided another example of an automatic deep learning-based algorithm with very good performance but limited to patients with SARS-CoV-2 and influenza virus infections.In this regard, contrary to many other similar studies on AI models [16,17], we decided to focus only on the differential diagnosis between COVID-19 and non-COVID-19 viral pneumonia, rather than a broader spectrum of pulmonary diseases. On the one hand, this choice was meant to stress the difficulty posed by the highly overlapping CT findings of these entities; on the other hand, the recognition of typical signs of bacterial infections, such as lobar consolidation, would most likely not require the support of an AI tool. In addition, even if rapid COVID-19 tests are currently widespread and help guide the clinical suspicion, they may be unavailable in some contexts (e.g., night shifts) or provide equivocal results. On the other hand, we envisioned our R-AI classifier as a tool for the radiologist to be used for pneumonia cases whose infectious nature is recognized but with ambiguous or discordant findings compared to clinical history or laboratory results. Nevertheless, in the future, it would be possible to further train the classifier with other lung diseases that mimic COVID-19, such as organizing pneumonia or drug-induced interstitial pneumonia, thus extending its applications.The main limitation of this study is represented by its retrospective design in a single institution, showing a selection bias. For example, COVID-19 and non-COVID-19 groups had different sample sizes, although this was limited by the fact that the readers were unaware of the case proportions. The R-AI classifier was trained and tested on a COVID-predominant dataset, as well. Additionally, the study population mainly included patients with moderate-to-severe pulmonary involvement based on the visual evaluation of the readers. The underrepresentation of cases with mild disease could represent a bias, even if the sample reflected the actual population for whom chest CT scan is recommended [31]. In addition, the CO-RADS score has been developed specifically for use in patients with moderate to severe disease [8]. Another limitation was that chest CT scans within 15 days from molecular evidence of infection were used, but the cause-and-effect relationship could have been fallacious. Indeed, some of the selected patients may have mixed pneumonia or other diseases. However, the large dataset used should have minimized the impact of this occurrence. Finally, the radiologists were not given clinical information during the evaluation, which could have further improved their performance. In the future, the generalizability of our results should be assessed with a prospective design in a multicenter setting, possibly incorporating clinical information in the AI model.

In conclusion, this work confirmed that distinguishing COVID-19 from other types of viral pneumonia is challenging, even for expert radiologists. Nevertheless, we showed that an artificial intelligence classifier based on radiomic features can provide diagnostic performance in this task comparable to human readers, and probably even better with equivocal cases. Once implemented in the clinical workflow, such a tool could support the radiological activity, for example, by providing a second opinion in case of ambiguous chest CT findings of pulmonary infection.

留言 (0)

沒有登入
gif