Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Models (CREMLS)


Introduction

The number of papers presenting machine learning (ML) models that are being submitted to and published in the Journal of Medical Internet Research and other JMIR Publications journals has steadily increased over time. The cross-journal JMIR Publications e-collection “Machine Learning” includes nearly 1300 articles as of April 1, 2024 [], and there are additional sections in other journals, which collate articles related to the field (eg, “Machine Learning from Dermatological Images” [] in JMIR Dermatology). From 2015 to 2022, the number of published articles with “artificial intelligence” (AI) or “machine learning” in the title and abstract in JMIR Publications journals increased from 22 to 298 (13.5-fold growth), and there are already 312 articles in 2023 (14-fold growth). For JMIR Medical Informatics, the number of articles increased from 10 to 160 (16-fold growth) until 2022. This is consistent with the growth in the research and application of medical AI in general where a similar PubMed search (with the keyword “medicine”) revealed a 22-fold growth (from 640 to 14,147 articles) between 2015 and 2022, and there are already 11,272 matching articles in 2023.

Many papers reporting the use of ML models in medicine have used a large clinical data set to make diagnostic or prognostic predictions [-]. However, the use of data from electronic health records and other resources is often not without pitfalls as these data are typically collected and optimized for other purposes (eg, medical billing) [].

Editors and peer reviewers involved in the review process for such manuscripts often go through multiple review cycles to enhance the quality and completeness of reporting []. The use of reporting guidelines or checklists can help ensure consistency in the quality of submitted (and published) scientific manuscripts and, for instance, avoid instances of missing information. For example, in the experiences of the editors-in-chief of JMIR AI, missing information is especially notable because for manuscripts reporting on ML models, which are submitted to JMIR AI, this can delay the overall review interval by adding more revision cycles.

According to the EQUATOR (Enhancing the Quality and Transparency of Health Research) network, a reporting guideline is “a simple, structured tool for health researchers to use while writing manuscripts. A reporting guideline provides a minimum list of information needed to ensure a manuscript can be, for example: understood by a reader, replicated by a researcher, used by a doctor to make a clinical decision, and included in a systematic review” []. These can be presented in the form of a checklist, flow diagram, or structured text.

In this Editorial, we discuss the general JMIR Publications policy regarding authors’ application of reporting guidelines. We then focus specifically on the reporting of ML studies in JMIR Publications journals.


JMIR Publications Policy on the Use of Reporting Guidelines

Accumulating evidence suggests that when authors apply reporting guidelines and reporting checklists in health research, they can be beneficial for authors, readers, and the discipline overall by enabling the replication or reproduction of studies. Recent evidence suggests that asking reviewers to use reporting checklists, instead of authors, offers no added benefits regarding reporting quality []. However, Botos [] reported a positive association between reviewer ratings of adherence to reporting guidelines and favorable editorial decisions, while Stevanovic et al [] reported a significant positive correlation between adherence to reporting guidelines and citations and between adherence to reporting guidelines and publication in higher-impact-factor journals.

JMIR Publications’ editorial policy recommends that authors adhere to applicable study design and reporting guidelines when preparing manuscripts for submission []. Authors should note that most reporting guidelines are strongly recommended, particularly because they can improve the quality, completeness, and organization of the presented work. At this time, JMIR Publications requires reporting checklists to be completed and supplied as multimedia appendices for randomized controlled trials without [-] or those with eHealth or mobile health components [], systematic and scoping literature reviews across the portfolio, and Implementation Reports in JMIR Medical Informatics []. Although some medical journals have mandated the use of certain reporting guidelines and checklists, JMIR Publications recognizes that authors may have concerns about the additional burden that the formalized use of checklists may bring to the submission process. As such, JMIR Publications has chosen to begin recommending the use of ML reporting guidelines and will evaluate their benefits and gather feedback on implementation costs before considering more stringent requirements.


Reporting on ML Models

Regarding the reporting of prognostic and diagnostic ML studies, multiple directly relevant checklists have been developed. Klement and El Emam [] have consolidated these guidelines and checklists into a single set that we refer to as the Consolidated Reporting of Machine Learning Studies (CREMLS) checklist. CREMLS serves as a reporting checklist for journals publishing research describing the development, evaluation, and application of ML models, including all JMIR Publications journals, which have officially adopted these guidelines. CREMLS was developed by identifying existing relevant reporting guidelines and checklists. The initial item list was identified through a structured literature review and expert curation, and then the quality of the methods used for their development was assessed to narrow them down to a high-quality subset. This high-quality item subset was further filtered to reveal those that meet specific inclusion and exclusion criteria. The resultant items were converted to guidelines and a checklist that was reviewed by the editorial board of JMIR AI, followed by a preliminary application to assess articles published in JMIR AI. The final checklist offers present-day best practices for high-quality reporting of studies using ML models.

Examples of the application of the CREMLS checklist are presented in . In doing so, we identified 7 articles published in JMIR Publications journals, which exemplify each checklist item. Note that not all of the items are relevant to each article, and some articles are particularly good examples of how to operationalize a checklist item.

Table 1. Illustration of how various articles published in JMIR Publications journals implement each of the CREMLS (Consolidated Reporting of Machine Learning Studies) checklist items.Item numberItemExample illustrating the itemStudy details
1.1The medical or clinical task of interestExamines chronic disease management—a clinical problem with 4 example solutions using MLa models []
1.2The research questionProposes a framework to transfer old knowledge to a new environment to manage drifts []
1.3Current medical or clinical practiceProvides a review of current practice and issues associated with chronic disease management []
1.4The known predictors and confounders of what is being predicted or diagnosedDescribes variables defined as part of a well-established health test available to the public []
1.5The overall study designPresents experimental design with data flow and data partitions used at various steps of the experiment () []
1.6The medical institutional settingsDescribes the institution as an academic (teaching) community hospital where the data were collected []
1.7The target patient populationClear partitioning of target patient populations and the comparator group []
1.8The intended use of the ML modelDescribes how the prediction model fits in the clinical practice of scheduling operating theater procedures []
1.9Existing model performance benchmarks for this taskReviews existing research and presents achieved performance (eg, AUCb) []
1.10Ethical and other regulatory approvals obtainedEthics approvals []The data
2.1Inclusion or exclusion criteria for the patient cohortDefined in in the paper by Kendale et al []
2.2Methods of data collectionDescribes sources and methods of data collection, what type of data were used, and potential implied bias in interpretation []
2.3Bias introduced due to the method of data collection usedDiscusses potential bias in data collection and outcome definition []
2.4Data characteristicsUses descriptive statistics to show data characteristics for different types of data (demographics and clinical measurements) []
2.5Methods of data transformation and preprocessing appliedImputation is discussed []
2.6Known quality issues with the dataMissingness and outlier detection were discussed []
2.7Sample size calculationBrief section dedicated to power analysis []
2.8Data availabilityExplains how to obtain a copy of the data []Methodology
3.1Strategies for handling missing dataDescribes how missing values were replaced []
3.2Strategies for addressing class imbalanceDescribes the approach of using SMOTEc to adjust class ratios to address imbalance []
3.3Strategies for reducing dimensionality of dataDescribes the vectorization of a dimension of 100 into a 2D space using an established algorithm []
3.4Strategies for handling outliersThe authors stated the threshold values used to detect outliers []
3.5Strategies for data augmentationShowed how variable similarity is achieved between synthetic and real data in the context of augmentation []
3.6Strategies for model pretrainingDescribes and illustrates () how models from other data sets were trained and used in the new model []
3.7The rationale for selecting the ML algorithmDiscusses properties of the selected algorithm relevant to the problem at hand as motivation []
3.8The method of evaluating model performance during trainingPresents a separate discussion of evolution in cross-validation settings and external evaluation while also describing hyperparameter tuning []
3.9The method used for hyperparameter tuningComprehensive description of tuning within nested cross-validation (this is a tutorial but illustrates how to describe the process) []
3.10Model’s output adjustmentsDescribes the final model, how it was calibrated and discusses the impact of embedding on patient data for interpretation []Evaluation
4.1Performance metrics used to evaluate the modelComprehensive and detailed discussion of evaluation and quality metrics []
4.2The cost or consequence of errorsComprehensive error analysis []
4.3The results of internal validationDetailed validation discussion (internally and externally) []
4.4The final model hyperparametersPresents details of the final model and the winning parameters []
4.5Model evaluation on an external data setDetailed and comprehensive external validation that is separate from model testing []
4.6Characteristics relevant for detecting data shift and driftImplements performance monitoring, addresses data shifts over time, and illustrates them in detail []Explainability and transparency
5.1The most important features and how they relate to the outcomesPresents variable importance (SHAPd values) in the context of interpretation and compares it to existing literature []
5.2Plausibility of model outputsShows sample output ( in the paper by Kendale et al [])
5.3Interpretation of a model\'s results by an end userGood discussion about interpretability and use of the final model []

aML: machine learning.

bAUC: area under the curve.

cSMOTE: synthetic minority oversampling technique.

dSHAP: Shapely additive explanations.

We strongly advise authors who seek to submit their manuscripts describing the development, evaluation, and application of ML models to the Journal of Medical Internet Research, JMIR AI, JMIR Medical Informatics, or other JMIR Publications journals to adhere to the CREMLS guidelines and checklist to ensure that they have considered and addressed all relevant details for their work before initiating their submission and review process. More complete and high-quality reporting benefits the authors by accelerating the review cycle and reducing the burden on reviewers. Hence, the need exists for reporting guidelines and checklists for papers describing prognostic and diagnostic ML studies. This is expected to assist, for example, in reducing missing documentation on hyperparameters for an ML model and to clarify how data leakage was avoided. We have observed that peer reviewers have, in practice, been asking authors to improve reporting on the same topics covered in the CREMLS checklist. This is not a surprise given that peer reviewers are experts in the field and would note important information that is missing. Nevertheless, we would encourage reviewers to use the checklist regularly to ensure completeness and consistency.

The CREMLS checklist’s scope is limited to ML models using structured data that are trained and evaluated in silico and in shadow mode. This provides a significant opportunity to expand on the CREMLS to different data modalities and additional phases of model deployment. Should such extended reporting guidelines and checklists be developed, they may be considered for recommendation for submissions to JMIR Publications journals, incorporating lessons learned from the initial checklist for studies reporting the use of ML models.


Conclusion

There is evidence that the completeness of reporting of research studies is beneficial to the authors and the broader scientific community. For prognostic and diagnostic ML studies, many reporting guidelines have been developed, and these have been consolidated into CREMLS, capturing the combined value of the source guidelines and checklists in one place. In this Editorial, we extend journal policy and recommend that authors follow these guidelines when submitting articles to journals in the JMIR Publications portfolio. This will improve the reproducibility of research studies using ML methods, accelerate review cycles, and improve the quality of published papers overall. Given the rapid growth of studies developing, evaluating, and applying ML models, it is important to establish reporting standards early.

KEE and BM conceptualized this study and drafted, reviewed, and edited the manuscript. TIL and GE reviewed and edited the manuscript. WK prepared the literature summary and reviewed the manuscript.

KEE and BM are co–editors-in-chief of JMIR AI. KEE is the cofounder of Replica Analytics, an Aetion company, and has financial interests in the company. TIL is the scientific editorial director at JMIR Publications, Inc. GE is the executive editor and publisher at JMIR Publications, Inc, receives a salary, and owns equity.

Edited by T Leung; This is a non–peer-reviewed article. submitted 04.04.24; accepted 04.04.24; published 02.05.24.

©Khaled El Emam, Tiffany I Leung, Bradley Malin, William Klement, Gunther Eysenbach. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 02.05.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

留言 (0)

沒有登入
gif