In this study, we aimed to develop and validate a generalised glaucoma screening model. The best-performing model (vgg19_bn) achieved promising accuracy, highlighting the potential of our approach to revolutionise the landscape of glaucoma screening. To the best of our understanding, no prior study had attempted to use extensive glaucomatous fundus data from diverse demographics and ethnicities, with images from various fundus cameras with different resolutions. Most deep-learning-based models were trained on a limited dataset for classifying healthy and glaucomatous fundus images from a single institution, which made the models non-generalisable for different populations and settings [10]. Our training dataset included 7,498 glaucoma cases and 10,869 healthy cases, gathered from 19 different datasets. This dataset—one of the most extensive clusters of fundus images ever used to develop a generalised glaucoma screening model—represents a wide range of ethnic groups and fundus cameras used, which could improve our model’s performance and make it more applicable globally. The best-performing model was selected in this study out of 20 pre-trained models (eFig. 2 in Supplementary)—choosing the right deep-learning architecture for a specific task is extremely important [38]. This extensive and diverse dataset using the potential deep-learning architecture can enhance the model’s generalisability, making it a versatile and practical tool for glaucoma screening in diverse populations.
Our best-performing model exhibited exceptional discriminative ability between glaucomatous and healthy discs; the model learned glaucomatous features from heterogeneous data. The vgg19_bn attained a high degree of AUROC of 99.2%, which was exceeded by ophthalmologists (82.0) and deep-learning systems (97.0) [39], demonstrating its potential for practical use in glaucoma screening. Li et al. trained and validated their model on 31,745 and 8000 fundus images, respectively [40]. The model performed exceptionally well, achieving an AUC of 0.986 with a sensitivity of 95.6%, a specificity of 92.0% and an accuracy of 92.9% for identifying referable glaucomatous optic neuropathy. Our model maintained balance across all performance metrics, as revealed in Table 2, for both glaucoma and healthy cases. The model was unbiased towards any particular class, making it a reliable tool for glaucoma screening for wider populations of glaucoma.
Furthermore, we implemented the DenseNet201, ResNet101 and DenseNet161 architectures (eTable 2 in Supplementary). The DenseNet201 demonstrated a classification accuracy of 96%, with an AUROC of 99%. Steen et al. employed the same DenseNet201 architecture, but their model achieved an accuracy of 87.60%, precision of 87.55%, recall of 87.60% and an F1 score of 87.57% on publicly available datasets containing 7299 non-glaucoma and 4617 glaucoma images [41]. Our study has a unique strength in balancing sensitivity and specificity, evident from our model’s high AUROC values—a significant advantage in real-world clinical settings. Many previous models had difficulty maintaining this balance, resulting in high false positive or false negative rates [16, 17, 42].
Liu et al. trained a CNN algorithm for automatically detecting glaucomatous optic neuropathy using a massive dataset of 241,032 images from 68,013 patients, and the model’s performance was impressive [43]. However, the model struggled with multiethnic data (7877) and images of varying quality (884), revealing a drop in AUC with 7.3% and 17.3%, respectively. In contrast, our model demonstrated a modest decline in accuracy, ~9.6%, when tested on the DRISHTI-GS dataset. We suspect that part of this performance shift might be due to inconsistencies and the lack of a clearly defined protocol for glaucoma classification across the publicly available dataset. We discovered specific variances in the classification criteria for glaucoma within the dataset (Fig. 2), which may have contributed to the slight drop in accuracy. Despite this, the model’s accuracy remained potent, indicating a strong generalisation capability. Nevertheless, it would be useful to evaluate the model’s performance across different datasets to confirm its reliability and generality further.
Investigating our model’s top losses, we ascertained two significant insights. First, the model did not perform well in classifying borderline cases, suggesting a need for advanced training techniques to handle such intricacies. Second, we identified potential mislabelling of fundus images within our dataset. This mislabelling could introduce confusion during the model’s learning phase, thereby decreasing performance. Both findings highlight the need for robust data quality checks and expert verification during dataset preparation. To improve the generalisability of the CNN model for glaucoma screening, we should consider accurate data labelled for training a model by glaucoma experts based on clinical tests rather than expanding the fundus data from multiethnic populations. Only five datasets (ACRIMA, REFUGE1, LAG, KEH and PAPILA) classify glaucoma based on comprehensive clinical examinations. In contrast, the remaining datasets used in this study either lacked detailed information regarding their study-specific classification of glaucomatous fundus images, or the diagnosis was solely based on fundus images. Given that accurate data labelling based on clinical tests is essential for training a robust model, the heterogeneity of disease classification may have impacted our model. Additional work profiling the validity and accuracy of these datasets is required.
We explored the decision-making process of our deep-learning model employing Grad-CAM to create heatmaps for the input fundus images. Heatmaps generated using Grad-CAM highlighted the regions of the fundus images that the model analysed when determining the presence or absence of glaucoma. Interestingly, the model’s emphasis areas align well with those that ophthalmologists would typically examine, such as the optic disc and cup, strengthening the clinical relevance of our model. These visual insights add a layer of transparency to our deep-learning model and provide a key link between automated classification and clinical understanding. These insights from the Grad-CAM heatmaps will be invaluable in ensuring the model’s decision-making process correlates with the clinical indicators of glaucoma. This can build clinicians’ trust in these algorithms, allowing for wide adoption in clinical practices.
Although our study demonstrates promising results, there are several limitations. First, our dataset had mislabelled fundus images, which could impact our model’s learning process and accuracy. We employed a data-cleaning procedure to address these challenges, removing 1306 images using ImageClassifierCleaner [30]. This process led to a cleaner and more reliable dataset, upon which we re-trained our model. This refinement considerably enhanced the model’s robustness and improved its generalisation ability to unseen data. Second, we observed that class imbalance could reduce the model’s effectiveness; however, we utilised class weight balance techniques to address this. Furthermore, a data augmentation technique was used in the training phase that could be different from the actual clinical images. Next, our Grad-CAM heatmaps indicated that the model occasionally focused on non-relevant regions for classification decisions, implying that the model might be learning from noise or artifacts within the images. Despite this limitation, the heatmaps confirmed that the model based its predictions on clinically interpretable features. Next, a substantial imbalance in the dataset (eTable 3 in Supplementary), with Non-Caucasians representing the majority of healthy cases (76.4%) and Caucasians accounting for nearly half of the glaucoma cases (47.4%), while Hispanics are underrepresented. This may impact our model’s performance and generalisability across different populations. Our study also acknowledges the publicly unavailable glaucomatous fundus data from the African continent (eFigure 1 in Supplementary) for our model’s training and validation. Incorporating glaucoma datasets from African countries could be highly beneficial to further enhance our model’s generalisability, especially in under-resourced areas. Finally, our model’s external validation was conducted solely on the DRISHTI-GS dataset. Future studies should aim to validate the model across multiple datasets, diverse populations and varied imaging devices to ensure broader applicability. Additionally, our model did not integrate clinical data, such as patients’ glaucoma history or IOP measurements, and visual field data, which could further enhance its predictive capabilities. Despite these limitations, the potential of our refined model for automated glaucoma screening remains significant and provides exciting prospects for future enhancements.
This research used fundus images to develop a robust computer vision model for glaucoma screening. The best-performing model (vgg19_bn) ascertained high values across multiple evaluation metrics for discerning disease status between glaucoma and healthy cohorts. This model offers a fast, cost-effective and highly accurate tool that can assist eye care practitioners in the decision-making process, ultimately improving patient outcomes and reducing the socioeconomic burden of glaucoma. However, the broad applicability of our model still needs to be validated on more ethnically diverse datasets to ensure its reliability and generalisability. Future studies should focus on the collation of images from people of diverse backgrounds.
留言 (0)