The development of artificial intelligence (AI) algorithms in healthcare has gained significant momentum in recent years, including in radiology. In early 2023, there were more than 200 commercially available AI software solutions for radiology alone [1]. One important aspect of external validation of AI models is the creation of benchmark datasets. A benchmark dataset is a well-curated collection of expert-labeled data that represents the entire spectrum of diseases of interest and reflects the diversity of the targeted population and variation in data collection systems and methods. Such datasets are vital for validating, in the sense of establishing the reliability and accuracy of, AI models, increasing trustworthiness, and the chance of robust performance in real-world applications [2,3,4].
If the dataset used to develop and validate an AI algorithm is not representative of the target population, biases could arise that could have severe consequences for a large group of patients [3]. For instance, if a dataset is derived from a relatively homogenous source population from within a well-established healthcare system, the developed algorithms may not generalize effectively to, for example, a limited-resource setting with different demographic and pathophysiological features of the population. This may further amplify health inequities, potentially leading to worse healthcare outcomes for those marginalized populations [5]. Also, algorithms developed on over-used public datasets derived from a hospital population may exhibit subpar performance if applied in a screening setting on individuals with similar demographics but different disease prevalence [6, 7]. This could lead to missed diagnoses on a large scale, especially in the light of automation bias [8]. Logullo et al [9] reviewed studies in which AI was trained to diagnose lung nodules (detect, segment, or classify them) using public datasets. They showed that 49% of their included studies used LIDC-IDRI [10] or LUNA [11] or a dataset derived from them during model development and/or validation. The characteristics of such public datasets might differ from those of the intended use case of an AI algorithm that utilized them for training/validation. For example, the volume quantification of nodules might have been derived from manual diameter measurements, which will give different results compared to fully automated measurements. In addition, these public datasets might have been preprocessed and their quality might differ from those used in clinical practice. It is therefore essential to perform further analysis to ensure the clinical utility of the dataset prior to deciding if it should be used for the particular task of interest.
It is imperative to create and enable access to benchmark datasets encompassing diverse populations and disease characteristics to validate the performance of an AI algorithm and test its generalizability. Moreover, the benchmark creation process must be transparent and rigorously documented. Furthermore, the dataset should be representative of the clinical context it is designed to address (e.g., screening and clinical diagnosis). Consequently, creating a benchmark dataset is not a straightforward task, as biases could arise in various steps in its formation process [3]. Factors to limit bias include the data sources used, anonymization steps, data format, and annotation methods.
There are initiatives to standardize infrastructure for validating AI software in imaging, enhancing transparency [1]. Furthermore, recommendations for a benchmark dataset for medical imaging in screening settings exist, but no standardized approach for clinical applications [12]. In pathology, proposals for creating test datasets to validate AI performance are already in place [13]. For more general AI solutions, it might be argued that local fine-tuning of a model and strict post-market surveillance is most efficient since data are scarce. However, before model deployment, the models’ weaknesses need to be established before introduction in the clinic, especially in rare diseases.
This paper explores the key considerations in creating imaging benchmark datasets (Fig. 1) to validate the performance of AI software, addressing challenges like data quality and data heterogeneity, and emphasizing domain experts’ input. Finally, it discusses metrics for evaluating model performance and provides recommendations for creating benchmark datasets in clinical practice. The primary objective of this paper is to guide the development of these datasets for AI software assessment in hospitals.
Fig. 1Considerations for the creation of a benchmark dataset
Imaging benchmark dataset creationWhen developing a benchmark dataset, there are several steps to be taken [4, 12]. The following section highlights and examines the most crucial of these steps.
Identification of specific use caseIt is essential to identify the specific use case(s) prior to creating a benchmark dataset. This involves considering various tasks such as object detection, binary or multiclass classification, segmentation, and regression, and their requirements (e.g., correct bounding box for detection, correct contour for segmentations, etc.). The clinical context, including the disease(s) of interest, modality, target population, and healthcare setting, should be clear, such as detecting chest X-ray abnormalities vs a normal chest X-ray in patients presenting to the emergency department of a secondary or tertiary referral center. Furthermore, it is important to identify the most accurate ground truth labels. In many cases the expert user is regarded as the ground truth, but more on practical grounds than based on actual proof. Follow-up of patients or more extensive diagnostics are often lacking resulting in the absence of a definitive ground truth. For example, biopsy results should be preferred to clinical observations to decide if a lung nodule is malignant, but they are either not available at all (yet) or just not included in the data collection. Furthermore, in this case, the required 2-year follow-up data that could be used to confirm the benign nature of nodules is also often lacking.
Representativeness of casesA crucial aspect to consider is the representativeness of cases encountered in clinical practice. The dataset must reflect real-world scenarios, including the disease severity spectrum, and ensure diversity in terms of demographics, vendors, and other factors.
One challenge that is difficult to address is the inclusion of rare diseases. Given their low prevalence, a very large sample size would be needed for these cases to be properly represented. Since it is commonly unfeasible to acquire a sufficiently large dataset, one proposed method is augmenting the dataset by generating synthetic data including variants of the underrepresented subsets [14]. For segmentation tasks, the inclusion of synthetic cases has been shown to lead to an improvement of the intersection over union (IoU) of up to 30% [15]. For detection tasks like that of the chromophobe subtype, synthetic histology images improved accuracy in clinical settings [16]. However, potential biases introduced by synthetic dataset heterogeneity in clinical practice are still under research [17].
Considering all the above-mentioned factors (spectrum of disease, diverse demographics, etc.) will help guarantee that the dataset is representative of the patient population and the intended clinical setting (e.g., primary care, public hospital, academic centers, or population screening).
For instance, a dataset derived from a population-based screening cohort is unsuitable for validating algorithms intended for routine computed tomography (CT) scans of the hospital population due to differences in scan protocols and disease prevalence. Validating algorithms is challenging due to clinical indication heterogeneity and incidental findings leading to new diagnoses, especially in broader clinical settings like abdominal CT scans. In these cases, there may be patients with varying indications ranging from analysis of an incidental finding to periodic oncologic follow-up [18, 19]. This is why it may be more straightforward to implement or evaluate AI techniques in highly specialized environments characterized by well-defined indications and a limited spectrum of findings, such as mammography screening [20, 21] and prostate cancer detection on MRI [22].
An example of a non-representative dataset in terms of population characteristics is the MIMIC-CXR dataset [23, 24], which consists mainly of data from a single hospital’s emergency department [6]. MIMIC-CXR is a large-scale dataset of chest X-ray images with associated radiology reports. For chest CT for lung nodule detection tasks well-known datasets are the LIDC-IDRI [10] and its derived LUNA16 [11]. Their popularity among researchers is due to being the only publicly available datasets providing lung nodule coordinates. However, AI solutions based on these datasets may have limited generalizability. A study by Li et al [25] showed that algorithms trained on independent datasets and LUNA16 maintained high performance when tested on a non-LIDC-IDRI dataset. In contrast to that study, Ahluwalia et al [6] showed that when chest radiograph classifiers are validated in a geographically and temporally different real-world dataset their diagnostic performance may drop in certain subgroups. Thus, caution must be exercised when applying a solution developed based only on, for example, the public LUNA16 dataset to real-world scenarios.
Proper labelingThe main characteristic of a well-curated benchmark dataset is that it should be properly labeled to be used as a reference standard for validation studies, ideally by having sufficiently long follow-up, or pathological proof (biopsy and/or histology). Often, reader consensus or majority voting is taken as a proxy, since histology or cross-sectional imaging of all participants is usually not available in a retrospective study design, nor ethically feasible in a prospective setting. This (inherently imperfect) approach requires the involvement of domain experts, including radiologists. The years of experience of these experts should be considered and reported, and cases with poor interobserver agreement should be identified and analyzed for any (systematic) errors. Another consideration related to the labeling process is the types of labels that should be accompanied by proper instructions, especially when these labels are collected from different hospitals, to ensure homogeneous results. It is also crucial to decide on the annotation format like DICOM (DICOM-SEG, RTSTRUCT), NIfTI, or BIDS [26]. For ultrasound images, any image annotation format that either preserves the grayscale image or the RGB colors is sufficient [27]. Diaz et al [26] provided a comprehensive guide of (open access) data curation tools and Willemink et al [28] presented a list of steps for preprocessing medical imaging data and explained the difficulties in data curation and data availability.
Another important consideration is the types of metadata that should be included along with the annotations. Metadata can include information such as de-identified patient demographics, relevant clinical history, etc., which can help contextualize the labeled data and provide useful information for downstream analysis. The inclusion and analysis of metadata should be done with caution since there might be correlations between metadata of different formats [29]. In addition to the above, metadata should also reflect the information available to an AI model in clinical practice, if it is to be used directly for inference in clinical cases [30]. At last, it is possible to include in metadata (like in DICOM-SEG), information on whether the labels were obtained manually, semi-automated, or fully automated using an AI algorithm, to ensure anonymity, as well as to allow the evaluation of inter and intra-observer variability. For cases with multiple binary segmentations (e.g., one from each radiologist) some approaches that can be used to select the input mask to an AI algorithm are taking the intersection of the masks, their mean, their union, or randomly selecting one of them. It is also possible to perform a majority vote on a pixel basis [31]. The above methods are two-stage approaches in which curated labels are created based on the available ones [32]. There is also the need to provide specific recommendations on how to deal with regions where radiologists are uncertain if they belong to a tumor or not [33].
Of equal importance to the type and format of metadata is the issue of data harmonization. Data collected from multiple centers is needed to enhance stability and robustness but exhibit variations in clinical and/or imaging characteristics obtained from diverse scanners and protocols [34]. Common harmonization techniques for tabular data include standard scaling and ComBat [35], whereas histogram equalization, adaptive histogram equalization, and contrast-limited adaptive histogram equalization are commonly used to harmonize medical images [36]. There are still open research questions regarding the limitations of reproducibility of harmonization methods, especially when the variations are related to radiomic features [37]. For example, the ComBat harmonization is a statistical method developed to remove the batch effects in microarray expression. However, unlike gene expression arrays for which ComBat was designed, radiomic features have different complexity levels, which are expected to be non-uniformly affected by variations in imaging parameters [38]. Furthermore, ComBat harmonization aims only to remove the variance attributed to the batch effects while maintaining the biological information, but using ComBat to correct these effects directly on patient data without providing the correct biological covariates that actually do have an effect on radiomic feature values will lead to a loss of biological signals. This is because ComBat will assume that the variations in radiomic feature values are only attributed to the defined batch, and thus would not perform uniformly [39]. For the above reasons, ComBat corrections cannot just be applied during inference, and it rather requires both the training and test data to be processed together by a model, changing the feature values as well. Even in the case of a single participant, the entire harmonization process should be repeated from scratch, and the model would have to be retrained as well. Therefore, the ComBat method cannot deal with prospective data (impractical to be used in clinical settings), since its performance depends on variations between batches, making its use not optimal and not applicable to clinical practice [39,40,41,42]. Currently, European Horizon 2020 projects work on data harmonization methods [43]. One of them is the ChAImeleon project [44] which recently announced a challenge in which harmonized multimodality imaging and clinical data will be provided for many types of cancer, allowing development and comparison of algorithms.
Sample size and bias considerationsA benchmark dataset should be appropriately sized for the task at hand, and should consider the clinically relevant difference in effect size, and the desired level of statistical significance to be achieved. Preferably, sample size calculations are performed, although no standardized method is available for modern AI tools to date due to their complexity [45, 46]. Generic sample size calculations can be performed in cases where areas under the curve (AUCs) are calculated, with a minimum sample size for a given AUC, confidence interval, and confidence level provided [47]. A review performed by Balki et al [46] showed that evidence of sample size calculations is scarce in AI applications in medical research. Only a few studies performed any kind of sample size analysis. Rajput et al [48] showed that to consider the sample size adequate, the classification accuracy of a model should be above 80% and the effect size should be bigger than 0.5 according to Cohen’s scale. For sample size calculation of the validation dataset, Goldenholz et al [45] developed a model-agnostic open-source tool that can be used for any clinical validation study. Balki et al showed that both pre- and post-hoc methods have their strengths and weaknesses and advise that researchers should try to perform both to estimate sample sizes or consult a biostatistical when conceptualizing a study [46]. It should also be noted that the choice of sample size also depends on the algorithm that will be used. More complex models (based on deep learning (DL)) usually require more data compared to machine learning algorithms (e.g., decision trees). As traditional sample size estimations cannot derive a conclusion regarding the clinical value of a machine learning algorithm due to its complexity; tools like sample size analysis for machine learning can be useful [45]. Using this tool, by specifying the performance metrics to calculate, and some other parameters such as the required precision, accuracy, and the ‘coverage probability’, an estimation of the minimum sample size required to achieve metric values above a cut-off value can be provided. For machine learning solutions other than predictive models there is still no consensus on the sample size, but the more variables and the rarer the outcome, the larger the sample size needed.
The availability of resources to collect the data, and the rarity of diseases of interest, may limit the number of cases unless the dataset is augmented. The dataset’s balance—whether maintaining natural disease prevalence or having equal normal and disease cases—depends on its intended use. If the dataset will be used to validate the real-world applicability of an AI algorithm, then the natural disease prevalence as present in the target population should be maintained. If the dataset’s purpose is to be used to train machine learning algorithms, then a balanced dataset is preferable since otherwise a very large sample size is needed to obtain optimal performance. Furthermore, it is not guaranteed that increasing the sample size will lead to a more accurate AI algorithm, as demonstrated in the case of distinguishing various clinical conditions that could indicate the presence of prodromal Alzheimer’s disease [49]. Efforts should be made to ensure that the risk of bias is low by considering possible factors of bias during dataset creation [30].
One dataset that is frequently used in the literature [50, 51] as an external validation for AI tools is the MIMIC-CXR dataset [22, 23]. Caution should be given to the fact that it consists of single-center data and might not be representative of geographically different populations. A study by Ahluwalia et al [6] showed that if a subgroup analysis is performed, the performance of chest radiograph classifiers is dependent on patient characteristics, clinical setting, and pathology. Still, the creation of such large databases can facilitate progress in creating AI solutions that could potentially be implemented in clinical practice and should be promoted, especially given the fact that they are still largely lacking for other imaging modalities like CT, MRI, and PET/CT. Many other forms of bias can arise during the data collection and annotation phases. A detailed overview is provided in a recent review [3].
Image quality and de-identificationWhen creating a benchmark dataset in radiology, image quality is crucial. Images must be free of artifacts that render them undiagnostic and should be correctly preprocessed [28]. Furthermore, to ensure reproducibility, any preprocessing of the images (e.g., noise reduction, intensity normalization, or augmentation) should be thoroughly described and the software (code) used should be made available to the researchers who will perform the validation. Images in a benchmark dataset should be acquired using appropriate acquisition settings and parameters, similar to those of the intended use. Be aware that images from older scanners in open datasets might differ from current clinical practice, making them unsuitable for benchmarks. Detecting the performance drift of an algorithm that was trained with such images, can be done with different methods such as just using the scan date to exclude them or unsupervised prediction alignment [52] to correct for that drift. Other methods include checking the metadata for parameters that indicate the year of the scanner, or the image quality of the scans and confirming that it is not of low resolution, that there are no signs of degradation, and that there are normal levels of noise present. Apart from the above, data drift can also be caused by changes in clinical population (demographic or disease prevalence changes), and/or changes in clinical guidelines, diagnostic criteria, and treatment protocols used in clinical practice. Therefore, these factors should always be assessed to evaluate if a data drift occurred.’
Data privacy and security are legally required in healthcare. Protection of personal data can be achieved through different techniques like randomization (deletion of identifiers), cryptographic techniques, restricted access, etc. [53], which also must comply with relevant regulations. In the case of a restricted dataset, hosting it using privacy-preserving techniques (e.g., encryption) can ensure the protection of sensitive information.
In the European Union (EU), privacy and security laws, especially Europe’s General Data Protection Regulation (GDPR), do not allow unrestricted data sharing with other institutions to improve models. Even with de-identified metadata, it has been shown that it is for example still possible to reconstruct the face of the individual who underwent an MRI scan of the head [54]. One pr
留言 (0)