Reproducibility of artificial intelligence models in computed tomography of the head: a quantitative analysis

We initially conducted a systematic literature search following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [12]. The risk of bias assessment was done via the Grading of Recommendations Assessment, Development and Evaluation (GRADE) tool [13]. A PRISMA checklist can be found in the Additional file 1. This phase started on March 1, 2021, using PubMed, Cochrane, and Web of Science databases. On PubMed, we used the following Boolean search terms with Medical subjehead CTings for further stratification purposes: “(artificial intelligence [mh] OR neural network [mh] OR machine learning [mh]) & (computed tomography [mh] OR CT [mh]) & (neuroimaging [mh] OR brain imaging [mh] OR brain [mh] OR head [mh])”. We carried out two literature search phases, including articles from January 1, 2000, to December 31, 2020, and a second updated search including articles between January 1, 2021, and November 1, 2021. This period was chosen because during an initial search we recognized that the first articles on head CT relevant to this review were published after 2000 and, more importantly, only increasingly in the last 7 years. A PRISMA diagram is shown in Fig. 1.

Fig. 1figure 1

PRISMA workflow The PRISMA workflow chart explains the selection of items to be analyzed

We only selected full-length articles for review that matched the following inclusion criteria: original research article; English language; proposing a machine learning model of any kind; computed tomography of the brain; involved human participants. We preliminary excluded publications when their abstracts, read by two independent readers (G.R., F.G.), did not include research on machine learning models or papers having a SCImago Journal Rank [14] below 2, which we considered as a quality cut-off (initial screening phase). We found that an abundance of publications being published in journals below this rank does not consistently follow the suggestions for good scientific practice. We further excluded articles not involving computed tomography scans of the brain, articles not peer-reviewed or other publication types (such as letters to the editor, conference abstracts, commentaries, case reports or other systematic reviews). The collection of data and exclusion was determined by two readers (F.G., G.R.) and independently confirmed by two separate readers, one being an expert in computational science and artificial intelligence (M.J.) and one being a certified experienced radiologist (E.M.H). Machine learning was defined as a method using any sort of computer algorithm being trained with data sets to automatically solve a specific problem, in this study radiology related issues.

Article selection

All full-length articles of the final selection (n = 83) were read each by one reader (F.G.), and information was manually inserted into a data frame in Excel (Office 365). A list of all variables and outcomes for which data were sought can be found in Additional file 1: Table S3. Sources of information on our epidemiological data are referenced in Additional file 1: Table S2. We analyzed research articles on algorithms with classification, detection, segmentation, or prediction tasks as main functions, where the demographic characteristics were listed, and the binary outcomes (pathology “A” or not pathology “A”) could be compared with real-word prevalence rates.

Statistical analysis

We extracted our data into Microsoft Excel, which was then used for our statistical analysis using an R script (RStudio 4.1.1). We used descriptive statistics and a Welch Students t-test for the assessment of the balancing of datasets (see Table 1). Our data sets and code can be found in the repository [15] or in the Additional file 1. For the measurement of performance statistics, we classified all articles by their main functionality (classification, detection, segmentation, prediction, triage, reconstruction, generation, fusion). Features of machine learning models were considered complete when the authors defined them as satisfactory for reproducibility, meaning that the individual parameters and features were either accessible in open source code or clearly listed in publications themselves (or in the supplements or prior studies). The mere mention of parameters was not considered sufficient (e.g., "the dataset was split into a training set and a validation set" vs. "the dataset was split 80/20 into a training set and a validation set").

Table 1 Balancing of training and test sets compared to real world epidemiologyDevelopment of research field

The literature search yielded 253 entries that met our search criteria. We excluded 93 records in our initial screening phase, keeping 160 full-text articles assessed for review. After the assessment for eligibility 83 studies were left for review. The whole selection process is shown in Fig. 1.

The number of publications has increased since 2013 with an annual growth rate of 20%. The number of publications per year starting from 2000 is shown in Fig. 2. In most studies, the main purpose of AI models was the prediction (n = 19) of specific events, being mostly the occurrence of intracranial lesions, outcome studies after interventions, or other pathologies concerning the brain. Other tasks included brain segmentation (n = 15), generation of synthetic CT images (n = 13), detection of lesions (n = 10), image reconstruction (n = 9), classification tasks (n = 8), fusion of MRI scans with CT images (n = 5), models on triage (n = 4) and automatic image registration into standardized spaces (n = 2).

Fig. 2figure 2

Number of publications per year and function. The number of publications started to grow around 2013 until 2020. The decrease in 2021 could be explained by the early end date of our review in November 2021 as probably not all papers of 2021 have been published yet

Transparency and SOURCES of data and code

Only a minority of authors provided open-source code (10.15%, n = 7). The data sets were mostly acquired from single center sources (81.9%, n = 68). Authors described the following steps before training as follows: augmentation steps (36.2%, n = 30), resolution of input scans (72.3%, n = 60), the definition of center of width of Hounsfield units or color space (32.5%, n = 27) and preprocessing steps (63.9%, n = 53).

Balancing of datasets compared with epidemiology

We analyzed all articles where prevalence rates could be applied to data sets (n = 30). We found a mean prevalence rate used in training sets of 50% (SD ± 31%) and a mean rate of 47% (SD ± 30%) in the test sets. This differed from real-world epidemiologic data, where prevalence rates reached a mean of 22% (SD ± 28%). The balancing of training and test sets compared to real word epidemiology is shown in Table 1. References for the epidemiological data used for the calculations are listed in Additional file 1: Table S3.

Types of algorithms

Types of machine learning algorithms used on CT brain images are (in descending order) convolutional neural networks (CNN) (n = 60), random forests (RF) (n = 8), dictionary learning (DL) (n = 6), support vector machines (SVM) (n = 6) and others (n = 9) (see Table 2).

Table 2 Frequency of algorithms used by authorsHyperparameters

As defined by J.Mongan in the CLAIM protocol [7], research papers on CNNs should at least contain the learning rate, the optimization method and the minibatch size used for training of algorithms. Furthermore, the dropout rate as well as the number of epochs should be reported, if any were used. Hyperparameters are also considered reproducible if open-source code is provided. We found the minimum number of hyperparameters defined in publications using CNNs only in eighteen cases (31.0%, n = 18). Authors described dropout rates for seventeen models (29.3%, n = 17) and defined the number of epochs in thirty-four publications (58.6%, n = 34).

The loss function was only defined in thirty-four cases (41.0%, n = 34). For the defined functions, nine used Cross Entropy loss (10.84%, n = 9), five Dice loss (6.0%, n = 5), three Mean Absolute Error loss (3.62%, n = 3), and two Euclidean losses (2.4%, n = 2). Other researchers made use of different loss functions (see Table 3).

Table 3 Frequency of loss functions used by authorsTypes of networks

Pretrained networks or basic frameworks were not used or not defined in twenty-two instances (31.3%, n = 26). In cases where the authors defined or used already existing networks, the following were the most frequent: U-Net (20.5%, n = 17), ResNet (10.8%, n = 9), VGGNet (4.8%, n = 4), PItchHPERFeCT (3.6%, n = 3), and GoogLeNet (3.6%, n = 3). Ground truths were most frequently reports or decisions of radiologists (47.0%, n = 39), followed by raw images without any human interaction (12.1%, n = 10). However, a substantial portion of the authors did not define their model’s ground truth at all or not to a satisfactory extent (37.4%, n = 31).

Illustration of model architectures

In sixty articles, graphical illustrations of one’s proposed model architecture were provided (72.3%, n = 60). The purpose of these illustrations is mostly to give readers an overview of the machine learning models. In compliance with CLAIM, the minimum details provided by these illustrations should be the data size and vector of the input data and a precise definition of the output layer. The intermediate layers may contain pooling, normalization, regularization and activation functions (again with the vector) and should show their interrelations. We propose a template of these components in the Additional file 1.

Training and validation

The authors defined the ratio between training and validation of data sets in more than two-thirds of instances (71.1%, n = 59). Additionally, more than 50% (56.6%, n = 47) validated their predictions with a separate test set being excluded from the training or validation data. Only five authors with separate test sets used a truly external test set from a different data source (6.0%, n = 5).

Metrics for model evaluation

Researchers measured a model’s sensitivity (resp. recall) and specificity, mainly in detection models (33.3%, n = 9; 29.6%, n = 8). The area under the receiver operator characteristic curve (AUROC) was predominantly used for prediction models (26.3%, n = 10), the dice score (resp. F1 score) mainly in segmentation applications (28.95%, n = 11). For models on the generation of virtual scans the mean absolute error (MAE) (22.9%, n = 5) and a structural similarity index (SSIM) (19.4%, n = 6) were primarily used as evaluation metrics. Models on fusion of imaging modalities were examined based on their peak signal-to-noise ratio (PSNR) (42.9%, n = 3). Additional file 1: Table S4 shows all the results in detail.

Comparison of algorithm performances

Twenty-eight authors published their machine learning solutions as proofs-of-concept (33.7%, n = 28) only. Some papers let their models’ performance compete with radiologists (16.7%, n = 12) or other algorithms (18.1%, n = 18). In all other articles, the models were compared to current non-machine learning approaches.

留言 (0)

沒有登入
gif