Exudative or neovascular age-related macular degeneration (nAMD) is one of the leading causes of severe vision loss in older adults worldwide [1] and affects an estimated 2% of Europeans aged over 65 [2]. The potential impact on individuals’ and caregivers’ quality of life is profound [3]. In addition, nAMD contributes to a significant burden on healthcare resources due to the need for ongoing monitoring and treatment with costly intravitreal anti-vascular endothelial growth factor (anti-VEGF) injections [4], with late diagnosis and delayed treatment initiation producing diminishing returns in terms of long-term visual outcomes [5]. The incidence of nAMD is projected to continue rising as the population ages, making early detection even more essential.
Advancements in artificial intelligence (AI) for retinal image analysis have the potential to improve patient outcomes by enabling early detection and more accurate diagnosis, and hence more timely intervention. This commentary features a recent Cochrane review which evaluates the accuracy of artificial intelligence (AI) tools in diagnosing nAMD [6].
The authors identified 36 eligible diagnostic test accuracy (DTA) studies published up to April 2024 which evaluated 40 AI algorithms. This comprised 16,655 participants across 20 studies analysing optical coherence tomography (OCT) scans, fundus images, infrared images, OCT angiography, or a mix of these. The total cohort size could not be determined as the remaining 16 studies did not report on the number of participants. Demographics were poorly reported as well – only four studies described participants’ age and sex, and none reported ethnicity. However, the populations studied did encompass countries across Asia, Europe, and the United States.
Twenty-eight algorithms were internally validated, demonstrating high accuracy with a summary sensitivity of 0.93 (95% confidence interval (CI) 0.89–0.96) and specificity of 0.96 (95% CI 0.94–0.98). A further three underwent external validation, demonstrating similarly excellent performance with a pooled sensitivity of 0.94 (95% CI 0.90–0.97) and a high specificity with wider CIs (0.99, 95% CI 0.76–1.00). The remaining nine studies did not provide data suitable for meta-analysis.
While these AI algorithms appear to perform promisingly well for diagnosing nAMD, these results should be interpreted with caution because of the risk of overfitting with small datasets and internal validation studies, and the low certainty of evidence from imprecision (wide CIs) and risk of bias.
None of the studies were free of bias across the four domains of the modified Quality Assessment of Diagnostic Accuracy Studies‐2 (QUADAS‐2) tool. For example, multiple studies did not report the numbers and experience levels of human graders setting the reference standard (16/36, 44%), describe masking or independence (35/36, 97%), or whether the graders were provided with clinical information (25/36, 69%). Given the subjective nature of image-based reference standards and the impact of errors on algorithmic performance [7], robust evaluations should involve at least two experienced graders and a robust arbitration process [8]. In addition, it is important to consider whether the reference standard should be based on expert(s) grading the same image as the AI model, or be benchmarked against the clinical gold standard with fluorescein angiography and/or multimodal imaging. For several studies, the image used was sometimes a single modality such a fundus photograph, which would not be used alone in clinical practice for nAMD detection [9].
Study design was another area of concern. The prevalence of nAMD across the studies was artificially high (33%, range 0.3–49%), as the majority (31/36, 86%) employed a case-control design, most of which compared patients with and without nAMD, rather than nAMD versus a spectrum of other retinal diseases with potentially similar imaging features. The former presents a distinct and less complex task than typically seen in clinical practice. In addition, the strict eligibility criteria and exclusion of patients with additional ocular conditions or diagnostic uncertainties may produce unrealistically optimistic results, as such “clean” datasets do not reflect real-world diagnostic challenges. This is especially true for the use case where these models could deliver the highest value proposition - nAMD detection for non-specialists managing populations with a wide variety of complaints. This misalignment between datasets and the potential implementation niche invites spectrum bias and risks inflating AI performance.
Overall, this review highlights a clear need for improved reporting of diagnostic accuracy studies. Reporting standards which cover diagnostic accuracy studies (Standards for Reporting of Diagnostic Accuracy Studies, STARD) and AI studies (Minimal Information about Clinical Artificial Intelligence Modelling, MI-CLAIM) are already well established [10, 11]. The forthcoming STARD‐AI extension [12] may help to improve reporting, but will require active implementation from journal editors and other stakeholders given the limited compliance with existing tools.
This Cochrane review has also surfaced inadequate reporting of sociodemographic characteristics across multiple studies, which limits our understanding of the model’s performance across diverse patient populations. The MINimum Information for Medical AI Reporting (MINIMAR) standards recommend reporting demographic variables including age, sex, race, ethnicity, and socioeconomic status at a minimum [13]. More recently, the STANDING Together collaboration has developed international consensus recommendations to highlight and/or mitigate bias in datasets used to develop and validate AI models [14]. In addition to reporting relevant patient metadata, the importance of evaluating AI performance across these patient subgroups is emphasized – beyond simply aggregating performance, assessing whether the AI is ‘safe on average, or safe for all’ is essential [15].
While the lack of external validation studies is concerning, real-world applicability should extend beyond simple dichotomous concepts of internal versus external validation. Future studies should consider real world deployment such as silent trials (also known as translational trials) [16], randomised controlled trials, or prospective deployment studies with adequate safety guardrails to evaluate algorithm performance in clinical environments. In addition to diagnostic accuracy, this should incorporate evaluations of human-computer interactions and patient-centred outcomes to obtain insights into the system-wide impact of AI models on healthcare services and help build robust evidence for clinical utility and feasibility.
There are several key considerations outside of this Cochrane review. Should an AI model for diagnosing nAMD function autonomously, or serve as a decision support tool for clinicians? Should it be used to triage symptomatic patients in primary care and remote settings where access to specialist care is more limited? What value can these models offer in well-resourced secondary care settings? Such considerations have important implications for evidence generation to support regulatory approval processes, and for other stakeholders such as payers and policymakers as they consider reimbursement structures that facilitate the sustainable provision of patient benefit by AI developers.
This Cochrane review highlights the potential of AI to transform current paradigms of nAMD detection. It also highlights significant gaps in the current evidence base, including inadequate reporting, external validation, and real-world evaluations. Addressing these gaps will require robust study designs, adherence to reporting standards, and greater clarity on how diagnostic AI can fit into the clinical workflow. These are essential steps towards bridging the “AI chasm” [17], and develop early signals of efficacy into products that can be integrated in routine clinical practice to achieve scalable benefit to patients and healthcare services.
留言 (0)