Upskilling or deskilling? Measurable role of an AI-supported training for radiology residents: a lesson from the pandemic

AI-supported task and tool

During the first pandemic outbreak, Radiology Unit 2 of Azienda Socio Sanitaria Territoriale ‘Spedali Civili di Brescia,’ one of the largest in Italy, introduced in the clinical routine a scoring system on Chest X-ray (CXR) images, referred to as the Brixia score [30], for the assessment of lung impairment in hospitalized patients with COVID-19. This is a multi-regional score, whereby the lungs are divided into six regions, and the referring radiologist assigns each region an integer score from 0 to 3, based on localized visual assessments of the severity of lung compromise, as shown in Fig. 1 (left).

Fig. 1figure 1

(Left) Brixia score: (a) zone definition and (bd) examples of annotations. Lungs are first divided into six zones on frontal chest X-rays. Line A is drawn at the level of the inferior wall of the aortic arch. Line B is drawn at the level of the inferior wall of the right inferior pulmonary vein. A and D: upper zones; B and E: middle zones; C and F: lower zones. A score ranging from 0 (green) to 3 (black) is then assigned to each sector, based on the observed lung abnormalities. (Right) User interface: The CXR is on the left, while on the right there is a column containing the image controls (windowing) and AI support, consisting of the explainability map, the regional score with confidence levels that pop up on mouse-over, and the label boxes where the radiologist should place his/her final score

Thanks to the work of the hospital’s radiology staff of around 50 radiologists, it was possible to collect a dataset of 5000 CXRs recorded in just one pandemic month, in the spring of 2020. Trained on this dataset, BS-Net has demonstrated the ability to robustly predict Brixia score values on images representative of the entire clinical reality and complexity, from different radiographic acquisition modalities (computed radiography and digital radiography) and manufacturers, acquisition directions (anteroposterior and posteroanterior) and patient conditions (e.g., standing, supine, with or without the presence of life-support systems). On a curated ‘consensus-based gold standard’ (CbGS) portion of the test dataset, rated by five radiologists, BS-Net surpassed radiologists in terms of prediction accuracy and inter-rater variability (see [29] for details).

BS-Net was also experimentally deployed on the hospital RIS/PACS, where a user interface has been designed to provide an AI-based support, to the radiologists for defining the Brixia score. This interface graphically presents confidence levels for each AI-assigned partial score in every lung region, along with an explainability map that offers insights into how BS-Net operates in such a complex task (Fig. 1, right).

Participants and data

Leveraging the implemented version of BS-Net, we analyzed the learning effects on residents resulting from the use of this integrated AI support system in clinical activity. The participants in our study were eight radiologists in training, four of whom were in their first year and four in their third year, and with whom we wanted to explore two main aspects. First, we investigated how different levels of integration between the AI tool and residents could potentially improve their scoring performance. Second, we assessed the ability of the residents to take control in the critical event of machine failure. Specifically, we evaluated the clinical performance of the residents in three distinct scenarios:

no-AI: the RIS/PACS interface presents only the CXR for Brixia score grading without any AI assistance;

on-demand-AI: the interface only displays the CXR, but AI support can be accessed by simply clicking a button on the RIS. Once accessed, a panel in the RIS displays the six Brixia scores assigned to each lung region by the AI support, together with the confidence levels for each score and the corresponding explainability map;

integrated-AI: the RIS/PACS interface simultaneously displays the CXR and all information provided by the AI support system (confidence values and map).

Each resident evaluated a set of 50 images for each scenario, covering the total of 150 images belonging to the CbGS dataset annotated by the five board-certified radiologists [29]. This dataset was partitioned into three fixed subsets, one for each scenario, so that all residents operate on the same image data. The images for each subset were randomly selected to ensure balance between variables such as age, Brixia score (global), gender and average score error by the five radiologists compared to the consensus score of the CbGS dataset, as shown in Table 1.

Table 1 Baseline characteristics of the three examined dataset blocks (50 images each, for a total of 150 CXR) reported in mean and standard deviation values

Before starting the tests, the participating radiologists received training on the AI system where they were shown examples generated by the same AI model used in the study.

Metrics

Given the reported positive impact associated with the use of BS-Net on CXR scoring in terms of increasing inter-rater agreement (see [29]), for each scenario considered here (no-AI, on-demand-AI, integrated-AI) we first calculated the mean and standard deviation values of average scoring errors (MAE and SD) computed on the CbGS dataset. Additionally, we assessed all scenarios in terms of inter-rater agreement among the eight residents when evaluating the Brixia score on the test image set. Among the common indices of inter-rater agreement, we utilized the intraclass correlation coefficient, ICC.

To evaluate potential situations that could lead to deskilling and, in the long run, undermine the continuous development of operators’ skills, we observed residents’ behavior when AI support failed. Specifically, in the two AI-supported scenarios, we assessed whether in situations of machine failure, i.e., when the AI’s prediction error exceeded an agreed threshold of ‘acceptability,’ residents uncritically relied on the AI’s suggestions or, on the contrary, demonstrated resilience to incorrect predictions.

Five expert radiologists, each with hundreds of cases of experience with such a semi-quantitative rating system, agreed to indicate ± 0.5 as an acceptable error for each region (where the score ranges from 0 to 3) and ± 2 as an acceptable error for the global score (ranging from 0 to 18). These indications were supported by both quantitative and clinical observations. First, the ± 2 threshold value was confirmed by the numerical assessment of errors performed by the same radiologists while scoring the CbGS, as reported in Table 1. Second, this threshold was supported by the prognostic value and associated use of the score as a severity indicator, which was derived from experimental evidence and clinical observations during the initial period of its application [33].

After the experimental activity, we requested the trainee radiologists to provide their opinions and experiences through a form across four axes: agreement, usefulness, trustworthiness, and future use. The full questionnaire, which utilized a 7-point Likert scale, is presented in Table 2. Additional multiple-choice questions regarding the scoring experience are shown in Table 3.

Table 2 Questionnaire posed to the residents after scoringTable 3 Multiple-choice questions posed to the residents after scoring

In the result section, we present the results obtained in the three tested scenarios: no-AI, on-demand-AI, and integrated-AI. We first report some observations and measures concerning the use of the on-demand-AI scenario, which, unlike the other scenarios, has the peculiarity of being activated by the voluntary request of the radiologist. Then, we present measures of the potential benefits related to the active use of AI techniques by residents during their training activities in the two AI-assisted scenarios and compare them both with the no-AI case. Considering the machine failure situations in the two AI-supported scenarios, we assessed the resilience attitude of the residents with respect to the AI’s erroneous predictions. Eventually, we report the responses to questionnaires aimed at evaluating the overall scoring experience in the AI-supported scenarios.

Statistical methods

Data analysis was conducted using Python 3.10 with the libraries statsmodels 0.14.2 and scipy 1.13.0. Descriptive statistics, including means and standard deviations, were used to summarize resident demographics and scoring characteristics. To assess the impact of AI assistance on scoring accuracy, the MAE and SD of scoring errors were computed for each scenario (no-AI, on-demand-AI, integrated-AI) using the consensus-based gold standard (CbGS) as the reference. Differences in MAE between scenarios were evaluated using paired t-tests, with a significance level of p < 0.001. The Bonferroni correction was applied to adjust the significance level to account for multiple comparisons. Inter-rater agreement among the eight residents in each scenario was quantified using the Intraclass Correlation Coefficient (ICC) [34]. Specifically, ICC-1 was employed to assess the absolute agreement among residents, assuming that they represent a random sample of residents from a larger population. ICC-1 values range from 0 to 1, with higher values indicating stronger agreement.

留言 (0)

沒有登入
gif