Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review
Amelia Bai1, Christopher Carty2, Shuan Dai3
1 Department of Ophthalmology, Queensland Children's Hospital, Brisbane; Centre for Children's Health Research, Brisbane; School of Medical Science, Griffith University, Gold Coast, Australia
2 Griffith Centre of Biomedical and Rehabilitation Engineering (GCORE), Menzies Health Institute Queensland, Griffith University Gold Coast; Department of Orthopaedics, Children's Health Queensland Hospital and Health Service, Queensland Children's Hospital, Brisbane, Australia
3 Department of Ophthalmology Queensland Children's Hospital, Brisbane; School of Medical Science, Griffith University, Gold Coast; University of Queensland, Australia
Correspondence Address:
Dr. Shuan Dai
Level 7d Surgical Directorate, 501 Stanley Street, South Brisbane QLD 4104
Australia
Source of Support: None, Conflict of Interest: None
DOI: 10.4103/sjopt.sjopt_219_21
PURPOSE: Artificial intelligence (AI) offers considerable promise for retinopathy of prematurity (ROP) screening and diagnosis. The development of deep-learning algorithms to detect the presence of disease may contribute to sufficient screening, early detection, and timely treatment for this preventable blinding disease. This review aimed to systematically examine the literature in AI algorithms in detecting ROP. Specifically, we focused on the performance of deep-learning algorithms through sensitivity, specificity, and area under the receiver operating curve (AUROC) for both the detection and grade of ROP.
METHODS: We searched Medline OVID, PubMed, Web of Science, and Embase for studies published from January 1, 2012, to September 20, 2021. Studies evaluating the diagnostic performance of deep-learning models based on retinal fundus images with expert ophthalmologists' judgment as reference standard were included. Studies which did not investigate the presence or absence of disease were excluded. Risk of bias was assessed using the QUADAS-2 tool.
RESULTS: Twelve studies out of the 175 studies identified were included. Five studies measured the performance of detecting the presence of ROP and seven studies determined the presence of plus disease. The average AUROC out of 11 studies was 0.98. The average sensitivity and specificity for detecting ROP was 95.72% and 98.15%, respectively, and for detecting plus disease was 91.13% and 95.92%, respectively.
CONCLUSION: The diagnostic performance of deep-learning algorithms in published studies was high. Few studies presented externally validated results or compared performance to expert human graders. Large scale prospective validation alongside robust study design could improve future studies.
Keywords: Artificial intelligence, deep learning, diagnosis, retinopathy of prematurity, screening
The concept of artificial intelligence (AI) dates back to the 1950s, when Alan Turing first discussed how to build and test intelligent machines in the paper “computing machinery and intelligence.”[1] It wasn't until 1956, however, at the seminal conference Dartmouth Summer Research Project on AI, did John McCarthy officially coin the term AI. This conference introduced a computer program designed to mimic the problem solving skills of a human, catalyzing the next 20 years of AI research.[2] Today, AI is incorporated into many applications for day-to-day life, including speech recognition, photo captioning, language translation, robotics, and even self-driving cars.[3],[4],[5] These applications are made possible through the use of deep learning, an advanced form of AI which self-learns from large training sets to program itself to perform certain tasks.[6] The application of AI has gained popularity in the medical diagnostic field, and promising outcomes have resulted from deep-learning screening algorithms in Ophthalmology.
There has been particular success in AI screening for diabetic retinopathy, with several groups reporting deep-learning algorithms detecting diabetic retinopathy at sensitivities and specificities of 83%–90% and 92%–98% respectively.[7],[8] Moreover, the successful validation of these algorithms has seen progression to “real-world” implementation of screening programs through prospective evaluation. One such study produced a sensitivity of 83.3% and specificity of 92.5% in detecting referable diabetic retinopathy in a prospective evaluation.[8] Similar promising results are being reported by many other groups utilizing deep learning for the diagnosis of other ophthalmic conditions including diabetic macula edema,[9] age-related macular degeneration,[10] glaucoma,[11] and retinopathy of prematurity (ROP).[12],[13]
ROP is a retinal vascular proliferative disease affecting premature infants whose diagnosis is dependent on timely screening. Globally, it is estimated that at least 50,000 children are blind from ROP,[14] and it remains the leading cause of preventable childhood blindness.[15] Advances in retinal imaging means disease is now easily identifiable by retinal photographs, making it a perfect candidate for deep learning. As survival rates of premature infants continue to increase with medical advances,[16] the demand for ROP screening is rapidly exceeding the capacity of available specialist ophthalmologists. For this reason, reports of deep-learning models matching or exceeding human experts in ROP diagnostic performance have generated considerable interest. It remains fundamental; however, that this enthusiasm does not overrule the need for critical appraisal as a missed diagnosis of ROP can result in significant sequelae such as blindness. Therefore, any deep-learning screening algorithm will need to show high diagnostic performance, high sensitivity, be generalizable, and be applicable to the real-world setting. In anticipation of deep-learning diagnostic tools becoming implemented into clinical practice, it is judicious to systematically review the body of evidence supporting AI screening for ROP. This systematic review aims to critically appraise the current state of diagnostic performance of deep-learning algorithms for ROP screening, with particular consideration for study design, algorithm development, type of validation, performance compared to clinicians, and diagnostic accuracy.
MethodsSearch strategy and selection criteria
Studies that developed or validated a deep learning model for the diagnosis of ROP and compared accuracy of algorithm diagnoses to ROP experts were included in this systematic review. We searched MEDLINE-Ovid, PubMed, Web of Science, and Embase for studies published from January 1, 2012 to September 20, 2021. The full search strategy for each database is available in [Appendix 1]. The cutoff of January 1, 2012 was prespecified based on an important breakthrough made with the development of deep-learning approaches in the model AlexNet.[17] The search was first performed on July 10, 2020, revised on May 23, 2021 and updated on September 20, 2021. Manual searches of bibliographies and citations from included studies were also completed to identify any additional articles potentially missed by searches.
Eligibility assessment was conducted by two reviewers who independently screened titles and abstracts of search results. Only studies aiming to identify through AI algorithms the presence of the disease of interest, ROP, were included. We accepted standard-of-care diagnosis, expert opinion or consensus as adequate reference standards to classify the absence or presence of disease. We excluded studies that did not test for diagnostic performance or investigated accuracy of image segmentation rather than disease classification. Studies which assessed the ability to classify disease severity were accepted if they incorporated primary results of disease detection. Review articles, conference abstracts, and studies that presented duplicate data were excluded. We assessed the risk of bias in patient selection, index test, reference standard, and flow and timing of each study using QUADAS-2.[18] Full assessment of bias can be found in [Appendix 2].
This systematic review was completed following the recommendations of the Preferred Reporting Items for Systematic reviews and Meta-Analyses[19] statement and the research question was formulated according to the CHARMS[20] checklist for systematic reviews of prediction models. Methods of analysis and inclusion criteria were specified in advance.
Data analysis
Data were extracted independently by two reviewers (AB and SD) using a predefined data extraction sheet, followed by cross-checking. Any discrepancies were discussed with a third reviewer (CC). Demographics and sample size (gestational age [GA], birth weight, number of participants, and number of images), data characteristics (data source, inclusion and exclusion criteria, and image augmentation), algorithm development (architecture, transfer learning, and number of images for training and tuning), algorithm validation (reference standard, number of experts, same method for assessing reference standard, and internal and external validation), and results (sensitivity, specificity, area under the receiver operating characteristic curve for algorithm(AUROC), human graders, and external validation if applicable) were sought. Two papers produced different algorithms from different data sets or with different identification tasks and were therefore recorded as separate algorithms in the results section.[21],[22] Data from all 12 papers were included and any missing information was recorded. In the case where sensitivity and specificity were not explicitly recorded but could be calculated from a confusion matrix, the calculated results were included.
ResultsOur search identified 175 records, of which 99 were screened [Figure 1]. Thirty full-text articles were assessed for eligibility and 12 studies were included in the systematic review.[12],[13],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31] Fifty studies were excluded due to no test of diagnostic performance,[32],[33],[34],[35],[36],[37],[38],[39] no classification task,[40],[41],[42] no internal validation,[23],[43] no AI algorithm,[44] and not based on standard clinical care.[45]
Data characteristics and demographics
All twelve studies obtained retrospective images as part of routine clinical care or from local screening programs. Seven of these studies collected images from China,[22],[24],[25],[26],[28],[29],[31] one from India, one from North America,[12] one from America and Mexico sites,[30] one from America and Nepal,[21] and one study included images from New Zealand.[13] Date range for image collection among all studies varied from July 2011 to June 2020. Three studies specified their inclusion criteria[25],[26],[31] and five other studies specified their exclusion criteria.[12],[13],[21],[28],[29] Poor quality images were excluded in five studies[12],[13],[28],[29],[31] and image augmentation occurred in seven studies.[13],[21],[25],[27],[28],[29],[30] These characteristics are summarized in [Table 1]. Seven studies recorded demographic information,[21],[24],[25],[26],[27],[29],[31] of which the mean GA was 30.9 weeks and mean birth weight was 1501.25 g. A total of 178,459 images were used across all 12 studies ranging from 2668 to 52,249 images per study. Five studies formulated an algorithm to detect ROP[21],[22],[24],[25],[31] and seven studies created an algorithm to detect the presence of plus disease out of a total of 5358 plus disease images.[12],[13],[26],[27],[28],[29],[30] Full details of demographics and sample size are found in [Table 2].
Table 2: Patient demographics and sample size for the 12 included studiesAlgorithm development and validation
Convolutional neural networks formed the basis for algorithms developed in all twelve studies. A variety of algorithms were utilized for transfer learning including ResNet, ImageNet, U-Net, and VGG-16, [Table 3], whereas one study did not use a transfer learning approach.[25] The majority of studies used <6000 images to train their algorithm; however, five studies utilized >10,000 images for algorithm development.[22],[25],[28],[29],[31] The reference standard across all twelve studies were based off disease diagnosis by 1–5 expert graders, with an average of 2.6 human graders agreeing upon each image per study. A variety of internal validation methods were recorded, including random split sample validation and cross-validation [Table 4]. Five studies[12],[13],[21],[22],[30] obtained external validation of their AI algorithms, of which one study completed a prospective evaluation of algorithm performance.[22]
Algorithm performance
The performance of each algorithm is listed in [Table 5]. Five studies recorded the ability of their algorithm to detect the presence of ROP disease, of which the average area under the receiver operating curve (AUROC) was 0.984.[21],[22],[24],[25],[31] Sensitivity and specificity were recorded in four of those studies and averaged 95.72% and 98.15%, respectively.[22],[24],[25],[31] One study compared human grader performance to the AI algorithm revealing similar sensitivities (94.1% AI, 93.5% human) and specificities (99.3% AI, 99.5% human) of ROP diagnostic performance.[31] Two of the five studies underwent external validation revealing an average sensitivity and specificity of 60% and 88.3%, respectively, for detecting the presence of disease.[21],[22] The seven other studies determined ability of their algorithm to detect the presence of plus disease. Among these, six studies measured AUROC, with which the average was 0.98.[12],[13],[26],[27],[29],[30] The average sensitivity and specificity for detecting plus disease recorded from six studies were 91.13% and 95.92%, respectively.[12],[13],[26],[27],[28],[29] External validation occurred in two of these studies and produced an average sensitivity of 93.45% and specificity of 87.35%.[12],[13] Performance of AI algorithm at detecting pre-plus disease was measured in two articles producing an average sensitivity of 96.2% and specificity of 95.7%.[12],[26] This is compared to four studies who measured performance of determining the stage of ROP disease, showing an average sensitivity and specificity of 89.07% and 94.63%, respectively.[22],[25],[28],[29]
DiscussionWe found that deep-learning algorithms for ROP screening demonstrated sensitivity and specificity metrics that were comparable to neural network algorithms in diabetic retinopathy.[46] Although this estimate supports the application potential for deep-learning algorithms to be implemented as a real-world diagnostic tool, there are several methodological deficiencies that were common across included studies which need to be considered. These include the quality of reference standard, use of sample size calculations, external validation, definition of presence or absence of disease, and the need for prospective evaluation.
First, we found variability in specific algorithm diagnostic targets with the 12 papers split between diagnosing the presence of ROP as a whole versus the presence of plus disease. It is important to differentiate these diagnostic targets as the clinical implication of the findings will differ. In addition, most studies utilized a reference standard graded by on average 2–3 experts with only one study producing a reference standard diagnosed by 5 clinicians per image.[31] It is well reported that there is a significant amount of intergrader variability in ROP diagnosis due to its subjective nature;[47],[48] therefore, caution needs to be taken in recognizing the potential for grader bias in studies utilizing only a few expert graders.
Second, there was a large variety in the number of images used to train each algorithm, ranging from 289[27] to 39,029 images.[29] Convolutional neural networks learn by computing the error between the machine's output and the image diagnosis; hence, the more images used to train a machine the smaller the error of its diagnostic output.[6] For this reason, the studies that had sample sizes in the ten-thousands were likely to have more reliable results than those that were trained off hundreds or thousands of images. Nonetheless, no studies reported formal sample size calculations to ensure sufficient sizing of studies. Despite the challenge of sample size calculations in the context of AI algorithms, it remains a principal component of any study design and only one paper reported sample size as a limitation in their study.[25] Future studies should consider formulating sample size calculations to justify the number of images required for algorithm design.
Thirdly, exclusion of poor-quality images or image augmentation may impact the performance of these deep-learning algorithms in the real-world clinical setting. This is a factor which may limit the diagnostic performance of an algorithm as high quality images correlates to high quality diagnoses and smaller algorithm errors.[6] For this reason, it is understandable that most papers will exclude poor quality images; however, it is important to keep this within reason. Quality of images used to train an algorithm should correspond to the quality of images taken in the clinical setting so that algorithm performance may equate to its real-life performance. It is also for this reason that external validation of an algorithm using an image set outside of the training image set is crucial to determine the generalizability of a study. Only five of the twelve studies completed external validation of which all but one study, showing equivalent performance, revealed inferior algorithm performance compared to their test set. This finding highlights the need for an out-of-sample external validation in these screening algorithms to better understand how the algorithm will perform in the clinical setting.
Fourth, the ground truth or reference standard labels were mostly derived from data collected for other purposes such as a database of ROP images or retrospective routine clinical care notes. Although there exists an internationally accepted guideline for defining presence and stage of ROP, the International Classification of Retinopathy of Prematurity revisited (ICROP)[49] (more recently updated in a 2021 version[50]), only five studies specifically mentioned the ICROP in their methods for defining the reference standard. As ICROP acts as the universally adopted diagnostic criteria for grading ROP, it is safe to assume that the other seven studies also used these guidelines; however, the criteria for the presence or absence of disease should always be clearly defined in AI studies.
Finally, only one study completed prospective evaluation of their algorithm, a process that is vital to assess the performance on real-world implications. The majority of studies assessed deep learning diagnostic accuracy in isolation, without external validation as mentioned earlier or comparison to experts. Only three studies provided a comparison of AI performance with human performance, allowing for evaluation of real-world application. Without comparison of AI to human performance, the results from the other seven studies are limited in their ability to be extrapolated into health-care delivery. In order for a deep learning diagnostic tool to be applicable in clinical bedside screening, it must perform better or comparable to the gold standard, in this case expert diagnosis. More work is required to validate the performance of AI algorithms in comparison to human graders, ideally using the same external test dataset.
It is clear from this systematic review that there still lacks a well-designed randomized head-to-head comparison of an effective externally validated AI algorithm to human performance in real-time. A study of this magnitude could reveal the possible clinical implications for an algorithm implemented in the clinical setting. For this reason, prospective evaluations of these deep-learning diagnostic tests are crucial to unveil the bounding potential of AI in both diagnostic and therapeutic medicine. We recognize that there is a large “black box” issue in deep learning, where image features learned by an algorithm is unknown to the user.[6] It is for this reason that many clinicians are sceptical to entrust clinical care to AI, especially when the clinical features clinicians are familiar with may not be the same features used by an algorithm. This further emphasizes the need for well executed studies that minimize bias and are thoroughly and transparently reported. Most of the concerns we have highlighted in this review are avoidable with robust design and it remains critical that these AI diagnostic tests are evaluated in the context of its intended clinical pathway.
ConclusionAI has been heralded as a revolutionary technology for many industries, and certainly deep learning algorithms for diagnosis of ROP are no exception. Despite the issues we have highlighted in this systematic review, the performance of the twelve deep-learning algorithms evaluated has been extremely high, with all studies delivering a recordable AUROC above or equivalent to 0.94. These results outline the ability for AI algorithms to perform comparable to or exceeding human experts and provide the groundwork for future large-scale prospective studies. Although there are clear screening and treatment guidelines, ROP disease burden continues to rise as increased survival of preterm infants coincides with advancements in medical care.[15] The inadequate accessibility and number of experienced ophthalmologists continues to limit ROP screening and diagnosis. Consequently, the burden of ROP visual impairment is expected to increase unless a novel strategy such as deep-learning diagnostic algorithms becomes available. There is no doubt that the successful application of AI in ROP will revolutionalize disease diagnosis through its high predictive performance and streamlined efficiency. The clinical implications of this implementation into real-world clinical practice is immeasurable, with translation into high accessibility, high quality, timely screening and the significant reduction in cost of screening. AI will therefore become ubiquitous and indispensable for ROP screening, and it is important that high quality research continues to aid the translation of this transformative technology in order to reduce the incidence of visual loss and blindness from this preventable disease.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References
留言 (0)