RIDGE: Reproducibility, Integrity, Dependability, Generalizability, and Efficiency Assessment of Medical Image Segmentation Models

The Introduction SectionI-1: Background, Purpose, and How the Segmentation Model Will Be Integrated Into Clinical Workflow

The introduction section should include the required clinical and scientific background for understanding the study and its potential applications and impact. The authors should state the clinical question they wish to address and the current standard of care for the ROI/VOI segmentation. If applicable, state-of-the-art approaches should be mentioned, as well as their drawbacks. Furthermore, the intended use of developed models or methodologies and how they contribute to the clinical workflow should be explained. Lastly, although not an absolute requirement, a paper discussing regulatory considerations (e.g., FDA for the USA), and integration into clinical workflow, along with the automated reporting of key results, would enhance the value of manuscripts, when applicable.

I-2: Study Objectives Regarding State-of-the-Art Segmentation Models

Often, the study objective is to address some shortcomings of the state-of-the-art segmentation approaches. For example, a study objective could be to develop a segmentation model that significantly outperforms the performance of the state-of-the-art models so that the resulting model can add value in a clinical setting. Explicitly stating a study objective(s) enables readers to better understand a study’s contribution. This also sets the expectation for reviewers and facilitates manuscript assessment.

The Materials and MethodsM-1: Prospective or Retrospective Study

It should be indicated whether a study has been conducted prospectively or respectively. Also, some segmentation studies have both retrospective and prospective components. For example, a segmentation model can be developed and evaluated retrospectively and then further evaluated prospectively. Such cases should be appropriately documented.

M-2: Objectives for Segmentation Models: Development, Exploration, Feasibility, or Comparison Studies

Whether the focus is on model creation, conducting an exploratory study, assessing feasibility, or a noninferiority trial, clear definitions and distinctions are essential. Setting a transparent goal helps readers and reviewers evaluate the technical and clinical contributions of the study and sets the context for research methodologies, result interpretation, and understanding of potential implications.

M-3: Data Sources, Including Imaging Modality, Treatment Received, and Protocol for Image Acquisition

Comprehensive details about the imaging modality (e.g., MRI and CT) and the corresponding imaging protocol are essential for the reproducibility of a study. It is important to specify if multiple scanners and different acquisition protocols were used. It is also crucial to specify if the experiments utilized pre-treatment images, post-treatment images, or both and to articulate the rationale behind such choices. This information facilitates readers’ and reviewers’ understanding of the utility and clinical relevance of the proposed approach and ensures a fair comparison with state-of-the-art methods.

M-4: Detailed Information Regarding the Sample Size Used in the Study

The sample size should be explicitly mentioned. Additionally, the composition of samples across known subpopulations or subcategories of interest should be detailed. It is essential to report the sample size at the patient or participant level if multiple data points per individual are used. If relevant, the process behind its selection should be mentioned. Additionally, the sizes or proportions of data used for training, validation, and test sets should be reported.

M-5: Eligibility Criteria

A detailed description of the process for selecting eligible participants should be provided. It is important to state where and when the potentially eligible patients have been identified. It is recommended to present this information through a flow diagram to enhance clarity. This diagram should illustrate the sequential application of each criterion and indicate the exact number of participants remaining after each step. Any potential biases introduced by the selection process and measures taken to address them should also be highlighted. Transparency in reporting key characteristics of the population of interest is not only important for generalizability but also can inform the target population for a given algorithm. A common example is providing information on whether a study included pediatric patients or not, which in turn would inform whether a tool should be used only in adults or whether it can also be used in the pediatric population.

M-6: Detailed Description of Ground Truth Standards to Allow Replication of Image Annotations

It is essential to provide a detailed reference standard that allows medical experts to annotate images without ambiguity. In situations where segmentation boundaries might be subjective and open to varied interpretations by different experts, a comprehensive description should be detailed to minimize interobserver variability. When multiple experts are involved in the annotation process, the methodology employed to reconcile discrepancies and arrive at a consensus for the ground truth should be explained. Alternatively, for certain segmentation tasks, some degree of variation is unavoidable. In these cases, the ground truth segmentation should be performed or edited by subspecialty-trained/certified, experienced professionals. Although there is no absolute rule, typically, segmentation from at least three independent experts is expected, although there are studies in the literature using greater than ten expert segmentations for the same patient. It may be acceptable to use one set of segmentation per patient (either performed by one or multiple experts) for training the algorithm, but the testing or evaluation should ideally be done by comparing multiple expert segmentations as described earlier. If multiple segmentations are being used to train an algorithm, then the process (e.g., use of a vote of majority-generated ROI/VOI) should be clearly reported. For testing or evaluation of the performance of a segmentation algorithm, another approach would be to report and compare the performance of the algorithm against the range of ROIs/VOIs (and consequently variations) of multiple experts.

M-7: Justification of Reference Standards for Ground Truth Image Annotations

In case boundaries of VOIs or ROIs are ambiguous, and several choices can be made for specifying them, the rationale for the choice made in the study should be provided. It is imperative to outline any potential impact the chosen reference might have on the segmentation outcomes or the study conclusions.

M-8: Source of Ground Truth Image Annotations; Qualifications and Training Process for Annotators to Generate Accurate Annotations

The authors should describe the qualifications of the annotators. Also, any training or preparation provided for the annotators before contouring the images should be described. When multiple annotators contour an image, the method for handling discrepancies between these annotations should be described. Also, it is important to state if the contours have been provided manually or in a semi-automated manner, where an algorithm is used to create rough contours that are then manually edited. Lastly, as mentioned earlier, some degree of variation is unavoidable in medical segmentation tasks, and in these cases, the use of multiple expert contours is optimal to ensure reliability and generalizability.

M-9: Tools Used for Image Annotation

The information about the tool(s) used for image annotation should be provided. This includes the name and the version of the software and the underlying operating system. If the annotation software provides multiple contouring tools, it is suggested that the specific ones used for image annotation be listed. When a semiautomated approach is adopted for contouring, details of the method for automatically generating the initial contours, as well as the subsequent refining process, should be described.

M-10: Measuring and Mitigating Interobserver and Intraobserver Variability; Methods for Resolving Annotation Discrepancies

During the data annotation phase, inconsistencies might arise due to multiple observers interpreting samples differently (interobserver variability) or a single observer providing varying annotations for the same sample on different occasions (intraobserver variability). The authors should describe the methods used to quantify interobserver and intraobserver variabilities, possibly through metrics such as Hausdorff distance, Dice coefficient, and Jaccard Index—also known as Intersection over Union (IoU). It is essential to detail any standardized guidelines, training, or protocols provided to the annotators to minimize this variability. Furthermore, the authors should outline the steps or procedures undertaken to resolve discrepancies and ensure annotation consistency.

Model DescriptionM-11: Detailed Description of Model Architecture, Model Inputs, and Model Outputs

The authors should provide comprehensive details about the model architecture to ensure the model can be reconstructed based on the provided information. The expected input(s) for the model should be explicitly outlined, including image type, size, and preprocessing steps. Similarly, the expected outputs and post-processing steps must be clearly described. If feasible, a link should be provided to a public repository where the code is available.

M-12: Strategy for Initializing Model Parameters

The strategy for model parameter initialization should be described. When a transfer learning approach is employed, it is essential to specify and detail the weights and biases from previously trained models. If pre-trained parameters are used, the authors need to clarify which layers remain open for retraining or weight readjustment tailored to the intended task. If the model is not based on a transfer learning architecture, the method for initializing the model’s parameters should be outlined.

Model TrainingM-13: Model Hyperparameters and the Methods for Choosing the Model Hyperparameters

The authors should describe the hyperparameters used in model training, including but not limited to learning rate, optimizer, and loss function. If hyperparameters are determined through a trial-and-error process, this procedure should be described, illustrating the range tested and the criteria for final selection. In cases where systematic hyperparameter tuning methods like grid search, random search, or Bayesian optimization are used, details of the search strategy and results should be included.

M-14: Image Preprocessing Steps

Image processing is often important in machine learning and deep learning applications. Preprocessing steps, if any, should be described with enough detail to allow for reproducibility of the results. These could include steps such as applying intensity normalizations, image cropping, or resizing.

M-15: Image Augmentation

Image augmentation is a common practice in developing deep learning-based segmentation models, specifically in the absence of large-scale annotated datasets [13,14,15]. In the context of image segmentation, data augmentation refers to methods that allow for computationally transforming an image so that the image annotation for the newly generated image can be computationally inferred, alleviating the need for further data collection or manual annotation (see Fig. 1). Image augmentation alleviates overfitting by introducing data variability and artificially increasing sample size. Often, a stochastic pipeline of image augmentation is composed, where a sequence of image augmentations, each with a given probability of being applied, is used to create an augmented version of an input image and its corresponding mask.

Fig. 1

Example of image augmentations. The original image (left) has been augmented by zooming (middle) and rotation (right)

To reproduce a data augmentation pipeline, it is essential to provide detailed information about the pipeline. Model-based data augmentation relies on a model to generate synthetic images [15]. When using model-based data augmentations, it is essential to provide information on how to acquire the model and instructions on how to use these models for image augmentation.

M-16: Criteria and Process for Final Model Selection

Model training is an iterative process where a model is updated over multiple epochs. Therefore, it is essential to establish and report criteria for selecting the best model from those developed across different epochs. This selection process can be informed by various performance metrics and stopping criteria. Consequently, reporting the specific performance metrics and the criteria used to halt the training process is crucial. A common approach, for instance, involves monitoring the loss function on the validation and training sets. Decisions can be based on a predetermined threshold for performance improvement on a validation set, a predetermined number of epochs without improvement, or other domain-specific criteria. Understanding these factors offers insights into the model’s reliability and its applicability to real-world scenarios.

M-17: Hyperparameters That Led to the Best Model

Unlike model parameters that are learned based on the training set during the training process, the hyperparameters are often either manually assigned or selected based on some heuristic methods. When using heuristics to choose the model hyperparameters, the validation set is used to select hyperparameters that result in better model performance. Examples of hyperparameters are learning rate, optimization algorithm, momentum, and batch size. The set of model hyperparameters that lead to the best result should be stated in the paper.

M-18: Ensemble Techniques: Model Diversity, Prediction Consolidation, and Computational Considerations (if Applicable)

Due to their high capacity for learning complex problems, deep learning models often exhibit high variance, especially when trained on small datasets. Ensemble approaches combine multiple models to reduce this prediction variance and enhance overall performance [4, 5]. The individual models within an ensemble may vary in their model architectures or in the datasets on which they have been trained. When employing ensemble techniques, it is important to outline these differences. Additionally, the method by which predictions from these models are consolidated to form a final prediction should be clearly described. Also, detailed information about the added computational burden of deploying these models should be provided. The computational requirement might run an approach impractical in some clinical settings.

Model EvaluationM-19: Metrics for Evaluating Model Performance

Providing correct metrics for model evaluation is essential in developing generalizable models. For example, for problems where the area/volume of interest is a small fraction of the image, pixel/voxel accuracy does not provide a meaningful measure of model performance, as a model that predicts all pixels/voxel not belonging to the area/volume of interest still achieves a high accuracy value. Dice score and Intersection over Union (IoU) are the most common measures used for model evaluation. To facilitate model comparison, providing these measures for all experiments is recommended. Also, for problems where other metrics, such as distance-based metrics, are commonly used to assess model performance, those should also be reported [6, 7].

M-20: Measuring Robustness or Sensitivity Analysis

Robustness refers to the ability of a model to maintain consistent performance despite minor perturbations or changes in input data. Noise and artifacts are common in medical imaging; therefore, segmentation models should be robust in order to be deployed in a clinical setting. Sensitivity analysis for a given model assesses the extent to which variations in input data affect the predictions made by the model. Given the diversity of human anatomy and the variability in medical imaging modalities, understanding which factors most influence model performance can offer key insights into its potential limitations and areas for improvement. Several best practices are recommended to ensure robust performance in medical imaging models. It is important to use established quantitative metrics such as the Dice coefficient and IoU to measure how model performance changes with variation in an input image. Visual comparisons could provide a qualitative sense of model predictions across different levels of image perturbation and noise. It is essential to report both successful and unsuccessful outcomes, as understanding model limitations is especially critical in clinical contexts. Models should also be tested using images from various sources and patient groups to ascertain widespread usability. Lastly, all assumptions about input data made during analysis should be explicitly documented to highlight their potential impact on the model’s clinical performance.

M-21: Internal Validation, External Validation, or Both

In internal validation, a subset of the dataset is used for model training and another subset for model evaluation. In contrast, the model is evaluated using an independently derived dataset in external validation. An external dataset often provides a better estimate of model generalizability and should ideally be the primary method for model evaluation whenever possible. It should be explicitly stated whether the model evaluation is internal, external, or both.

M-22: Level at Which Training, Validation, and Test Sets Are Disjoint (e.g., Patient or Institution)

When developing deep learning models, data are often partitioned into training, validation, and test sets. Partitions should ideally be conducted at the patient or institution level to ensure the same subject does not appear in more than one subset. Data partitioning at the institution level can further enhance model generalizability across different setups and data sources. However, when data from different institutions systematically differ, the model performance measures on these sets might substantially vary. In such cases, these differences should be studied and reported.

M-23: Data Points for Each Subject Are Exclusively Present in Training, Validation, or Test Sets

Due to substantial anatomical similarities between different images from the same patient, a model could associate irrelevant anatomical characteristics to an endpoint of interest instead of learning the condition under study. Consequently, the performance measures do not reflect the true model performance. In the context of segmentation models, this can lead the model to memorize the segmentation map for a patient based on these unrelated characteristics. Therefore, to avoid such issues, data points from the same patient should be confined to just one of the training, validation, or test sets.

M-24: Oversampling Is Not Applied Before Splitting Data into Training, Validation, and Test Sets

Oversampling can contribute to developing segmentation models for imbalanced datasets, particularly for rare pathologies or conditions. However, if oversampling is performed before dataset partitioning, there is a risk that identical images could be distributed across the training, validation, and test sets. This would allow the model to memorize the ROIs/VOIs, resulting in a misleadingly over-optimistic assessment of the segmentation model. Therefore, it is essential to partition the dataset and then apply oversampling on the minority class(es) in the training set. This approach ensures that the model is trained on a more balanced dataset without compromising the integrity of the evaluation process.

M-25: Image Augmentation Is Not Applied Before Splitting Data into Training, Validation, and Test Sets

Data augmentation should only be applied after splitting the dataset into training, validation, and test sets. Although when applying image augmentation, some of the characteristics of the augmented image change, the resulting image still retains a substantial amount of information with the original image (see Fig. 1). Consequently, a model could achieve high-performance measures by memorizing segmentation maps, leading to an overestimation of performance measures and models that lack generalizability. By performing augmentation only on the training set, the overall performance of the model can be improved, without compromising the integrity of the evaluation process.

M-26: Samples in the Test Set Are Not Used to Make a Decision About Preprocessing, Model Training, or Post-processing

Samples in the test set should not be used for selecting preprocessing or post-processing steps or during model training or post-processing. Failing to adhere to this guideline can prevent the test set from providing an unbiased estimate of the model’s generalization error. This oversight might lead to over-optimistic performance measures that do not accurately represent the performance of the model on unseen data.

M-27: Describing Demographic and Clinical Characteristics of Training, Validation, and Test Sets

Demographic and clinical characteristics of samples in training, validation, and test sets should be described better to evaluate the clinical utility of a proposed model and to enhance the reproducibility of the results. For example, age groups (e.g., pediatric vs. geriatric populations) can have significant differences in anatomy, affecting how ROIs/VOIs appear on a medical image. Also, a trained model might perform differently for patients with different treatment histories or disease subtypes, resulting in substantially different performance measures for different compositions of test sets.

M-28: Strategies to Enhance Segmentation Model Robustness to Common Image Variations

Image variations, which are inherent in medical imaging due to factors like diverse acquisition protocols, hardware differences, and software discrepancies, must be effectively managed by a model to ensure its reliability and deployability in a clinical setting.

To address these variations, techniques such as data augmentation and domain adaptation can be employed. Domain adaptation techniques can help the model generalize across different imaging settings by aligning the feature distributions from different domains, ensuring the model performs consistently well regardless of the source of the images. For example, this could enable models to perform well in the presence of image artifacts, noise, or systematic variations in image acquisitions.

M-29: Software Libraries, Frameworks, and Packages

Libraries, packages, and frameworks used for training and evaluating the model(s) should be described with enough detail to allow for reproducing the results.

M-30: Availability of Trained Model and the Inference Code for Segmenting ROIs/VOIs in an Image Provided in a Standard Format, Except when Restricted by Intellectual Property Considerations

The trained model and inference code should preferably be accessible online, enabling readers and reviewers to assess the performance of the developed model with their own datasets or samples, facilitating comparison with current and future research. This accessibility should include both the preprocessing and post-processing pipelines.

The Results SectionR-1: Estimates of Performance Measures and Their Variability

We recommend providing a comprehensive report on the performance of proposed medical image segmentation model. This encompasses not just the primary metric for assessing the performance of the segmentation model, such as Dice or IOU, but also the confidence intervals that reflect the uncertainty of these measures. Furthermore, given the diversity of medical imaging conditions and modalities, it is essential to highlight potential fluctuations in model performance across known subpopulations or subcategories. Factors to consider include variations in patient demographics, imaging equipment, imaging protocols, and external interferences or noise.

R-2: Failure Analysis of Poorly Segmented Cases

Variations in image quality, human anatomy, and pathology, as well as overlapping structures, often lead to errors in predictions made by segmentation models. Segmentation errors typically manifest as false positive segments, false negative segments, or boundary inaccuracies. For example, when normal tissue is predicted as a tumor, it is a false positive segment; when the model partially or completely misses a tumor, the missing part is a false negative segment.

Often, a single measure is used to describe model performance. However, this approach can prevent a comprehensive understanding of model errors, especially in the presence of systematic errors. For instance, when images feature several ROIs of varying sizes, a model that misses small ROIs but accurately identifies large ones could still achieve a high aggregate score, such as IoU or Dice score. This can be misleading, as the model might be medically unreliable. Analyzing these errors offers insights that can guide model refinement. We recommend visualizing examples where a model fails to perform a medically desirable segmentation. The model errors could be assessed quantitatively or qualitatively.

R-3: A Scatter Plot Representing the Distribution of the Size of Region(s) or Volume(s) of Interest for Training, Validation, and Test Sets

We recommend visually assessing the distribution of sizes of the ROI(s) or VOI(s) across the training, validation, and test datasets. Each ROI or VOI can be represented as a point in this scatter plot. The Y-axis of the plot indicates the size/volume of an ROI/VOI, and different colors can be used to represent samples in training, validation, or test sets. This visualization provides insights into potential biases or imbalances in the size distribution of ROIs or VOIs. A balanced and overlapping distribution across the training, validation, and test sets suggests that the model has been trained and evaluated on a representative sample, minimizing the risk of overfitting to a specific range of ROI/VOI sizes or compromising model generalizability. Moreover, examining the overall size distribution of VOIs or ROIs can highlight the clinical utility of these models. For instance, a model primarily trained to detect large lymph nodes might have limited clinical relevance. A visual exploration of the size of ROIs/VOIs can quickly pinpoint such issues. These scatter plots can also highlight potential biases related to various confounders, such as imaging hardware, software, protocol, patient demographics, or medical conditions. This can be achieved by utilizing different shapes or colors for data points representing samples from each category of potential confounders.

Bland–Altman plots and MA plots can also be used to evaluate the discrepancies between model predictions and the ground truth regarding the size of ROIs/VOIs. A Bland–Altman plot, also known as a Tukey mean-difference plot, visualizes the difference between the two measurements against their mean. The MA plot is essentially the Bland–Altman plot for log-transformed values. MA plots use log-transformed values to depict variations from mean values.

R-4: Analyze Bias Across Patient Categories such as Relevant Sociodemographics and Imaging Protocols and Hardware

A model might not achieve the same performance level across all patient population subgroups. These subgroups could be defined based on sociodemographic characteristics such as age or sex, or based on factors such as imaging protocol, imaging hardware, or disease type, to name a few. It is essential to assess the performance of a model across medically relevant groups to avoid bias. By offering a detailed performance breakdown by these categories, we can enhance comprehension of the strengths and weaknesses of a model. Additionally, examining images from diverse categories provides a more comprehensive view of the model’s performance.

R-5: Failure Analysis by Visualizing the Worst-Performing Cases of the Model in the Internal Test Set and, if Applicable, in the External Test Set

A visual review of the most inaccurate predictions from the internal test set and, if applicable, the external test set is highly recommended to rapidly pinpoint areas that need refinement and assist with model improvement. It also uncovers and assists in mitigating inherent biases that the model might have. This rigorous analysis assists readers and reviewers in understanding where a model might falter and informs us about its trustworthiness. Further, by examining potential failures on external datasets, we can verify the ability of a model to generalize across diverse scenarios.

R-6: Performance on External Dataset(s), if Possible, and Explaining Any Statistically Significant Difference Between Performance Measures for Samples in the Internal and External Test Sets

Evaluating the performance of medical image segmentation algorithms on external datasets is crucial to ensure their generalizability across different data sources and conditions. Solely relying on internal datasets may lead to scenarios where a model performs exceptionally well on one specific dataset but fails in real-world scenarios. Statistically assessing any significant discrepancies between the performance of a model on the internal and external test set(s) provides insights into potential biases, limitations, or the robustness of the model, ensuring safer and more reliable clinical application.

The Discussion SectionD-1: Study Limitations, Potential Biases, and Generalization Concerns

Limitations of the study should be clearly detailed in the Discussion section. Potential biases, such as over-representation or under-representation of specific conditions or demographics that might have emerged during data collection, must be emphasized. If there is a lack of external evaluation, or if the external test set may not comprehensively represent the entire population, these issues should be explained. Additionally, any limitations related to study design, data quality, limitations related to the ground truth, or model implementation must be succinctly articulated.

D-2: Practical Utility and Clinical Integration of Segmentation Models

The authors should discuss the practical utility of their model in a medical context. This will help readers and reviewers grasp the significance of the work and understand how it can potentially be integrated into medical practice. Discussion of adequacy or potential challenges to clinical deployment from a regulatory perspective and how the algorithm might be integrated into the workflow for clinical practice implementation would be desirable for a high-impact comprehensive study.

D-3: Highlighting Data Imbalance due to Differences in the Size of ROIs/VOIs and Its Potential Effect on Performance Measures

If there is any data imbalance resulting from varied ROI/VOI sizes, the authors should clearly highlight this. They should also articulate the impact of such imbalance on the performance of the proposed model, including potential improvements or deteriorations if the dataset were balanced. Furthermore, the authors need to discuss the measures they took to address this imbalance, thereby demonstrating the value of their work.

The Conclusion SectionC-1: Concise Presentation

The authors should succinctly describe the contributions of their work in this section. This may include the novelty of the proposed approach, primary contributions, a brief overview of the methodology, the most notable findings of the study, and potential implications or future directions for research.

C-2: Proper Positioning of the Work in the Context of State-of-the-Art Practice, if Applicable

Proper positioning of research within the context of state-of-the-art practices provides readers and reviewers with a clear understanding of how the presented work compares with or advances beyond the current best practices in the field. In instances where particular research does not necessarily surpass the state-of-the-art, it is still important to understand its position in the broader landscape, as this can illuminate complementary aspects, potential synergies, or alternative perspectives.

C-3: Recommendations for Future Work, if Applicable

To guide subsequent studies, it is recommended that the authors detail challenges faced, highlight unaddressed gaps, and suggest potential methods or refinements for the future. This section should also hint at broader areas warranting exploration based on current findings, propose practical real-world applications, and, if applicable, provide an overview of planned subsequent research. By doing so, the paper paves the way for future scholarly endeavors.

C-4: The Conclusion Is Adequately Supported by the Results of the Study

It is essential to ensure that the conclusions of a study are directly derived from, and supported by, the empirical findings presented in the paper. It is imperative that there be no overgeneralization of results. By ensuring that the conclusions are firmly grounded in the actual results, the study preserves its integrity, credibility, and relevance to its audience.

Source CodeS-1: Code Is Made Available, or if Not, Is Justified Within the Manuscript as to Why It Is Omitted

Source code greatly assists readers in comprehending the methodology detailed in the articles. By executing the code in their own environments, readers can conduct experiments with new data using the existing methodology or potentially leverage it to develop new methodologies. This availability of code also facilitates comparisons with future research and the state-of-the-art in the field. Thus, it is recommended that the source code be made accessible and referenced in the article. If, for any reason, the code cannot be made available, authors should articulate the reasons for its omission.

View original article

JOURNAL OF DIGITAL IMAGING

分享书签

0 0 0 0 0 0 0

More from this channel

RIDGE: Reproducibility, Integrity, Dependability, Generalizability, and Efficiency Assessment of Medical Image Segmentation Models

留言 (0)