Interpretable machine learning-derived nomogram model for early detection of diabetic retinopathy in type 2 diabetes mellitus: a widely targeted metabolomics study

Study participants determination and their characteristics

Depending on the PSM approach, a total of 69 blocks, comparable in clinical characteristics except for systolic blood pressure (SBP) and duration of diabetes, were obtained. The mean age of the matched participants was 56.7 years with a standard deviation of 9.2, and 61.4% were males. Among those 69 DR patients, 9 (13.0%), 31 (44.9%), 20 (29.0%), and 9 (13.0%) were classified into mild, moderate, severe NPDR and PDR, respectively. The demographic and clinical characteristics of the participants are given in Table 1.

Table 1 Clinical and demographic characteristics of the study population.Data preprocessing and feature screening

The flowchart of data preprocessing and feature screening from the metabolomics data could be found in Fig. 1A. Among a total of 532 metabolites detected by UPLC-MS/MS system, 483 features had CV under 30% in the QC samples, and 449 with missing values less than 20%. In the end, 380 features were included in the final data analyzes after the screening via the variance filtering and mutual information methods.

Construction and evaluation of machine learning model

Table S1 shows the classification results of the parameter optimization model based on the Sklearn package in Python. It can be seen that RF and XGBoost, which are typical algorithms under the framework of Bagging and Boosting in the integrated algorithm, had excellent classification capabilities. Accuracy, precision, recall, and F1-score were all above 95%. NNs also had excellent classification performance. As typical representatives of interpretable machine learning, the model performance of DT and LR was close to RF, XGBoost, and NNs, with each evaluation index higher than 90%. KNN, GNB, and SVM were slightly inferior.

The accuracy of the 10-fold cross-validation of eight machine learning models was tested by Friedman. As shown in Fig. 1B, the accuracy of the models was not the same (P < 0.05). The Nemenyi test was further used for the pair-wise comparison of the accuracy of the eight models. From the heat map of the model comparison, it could be seen that the model performance of KNN and GNB was inferior to RF, XGBoost, and NNs (P < 0.05). There was no significant difference among the other models (P > 0.05).

Considering the interpretability of the model, DT and LR have inherent advantages, and in this study, the performance of these two models was not inferior to other models. Compared with LR, DT model was more concise. Therefore, DT was selected for further analysis considering the performance, interpretability, and simplicity of the model.

Constructing a decision tree and verification of prediction accuracy

As we all know, the parameter adjustment strategy has a huge influence on the DT, and the correct strategy is the core of optimizing the decision tree algorithm. First of all, we used the hyper-parameter learning curve to determine the maximum depth of the tree. As shown in Fig. 1(C), when the parameter max_depth = 2, the model had the highest accuracy. Furtherly, we used grid search technology to determine the optimal parameters of the tree model (criterion = ‘gini’, min_samples_leaf = 1, min_samples_split = 2).

It could be seen from Fig. 1(D) that the root node of DT was 2-pyrrolidinone. Participants with 2-pyrrolidinone peak areas over 9,910,000.0 were divided into the healthy control group. The second node for the branch of 2-pyrrolidinone under 9,910,000.0 was thiamine triphosphate (ThTP). All participants with ThTP peak area under 24350.0 were classified as DR patients, otherwise, they were classified as DM patients.

Application of the CART DT yielded good discrimination of DR in the training set (accuracy, 94.6%). To evaluate the generalization ability of this DT, we used the hold-out and 10-fold cross-validation to assess it at the same time (Table S2). We found that the accuracy of DT evaluated by the hold-out and cross-validation were 93.3% and 94.3%, respectively. Precision, recall, and f1-score were also higher than 90%.

Combination of clinical and metabolic biomarkers for DR recognition

Identifying DR cases from T2DM patients efficiently was the major objective of this study. As shown in Fig. 1D, ThTP could achieve this goal well, and the correlation analysis once again verified this result (Table S3). After adjusting for SBP and the duration of diabetes, the association between ThTP and DR was significant. With each increase in standard deviation (SD), the probability of DR occurrence was reduced by 100% [OR:0.00, 95%CI (0.00, 0.03); P < 0.001]. According to the cutoff points found by the DT model, the probability of developing DR in people with ThTP level less than 24350 was 311.32 times that in those whose serum ThTP were above or equal to 24350 [OR: 311.32, 95%CI (32.75, 2959.78); P < 0.001]. In the multivariable analysis, the probability of occurrence of DR increased by 23% for every year extension of the disease duration [OR: 1.23, 95%CI (1.03, 1.48); P = 0.023]; and for every additional SD, increased by 228% [OR: 3.28, 95% CI (1.05, 10.27); P = 0.042]. According to the cutoff point found by the cubic spline curve (Fig. S2), the probability of DR in people with a disease course longer than 10 years is 22.95 times that of a population shorter than 10 years [OR: 22.95, 95% CI (1.73,304.65); P = 0.018].

Although the SBP did not reach statistical significance in the multivariable analysis, considering its clinical importance, we still included it in the model analysis. Then, we combined the above 3 biomarkers to develop a screening model and displayed it as a nomogram diagram (Fig. 2A) in the training set. The calibration curve of the nomogram to predict the DR risk in T2DM patients showed nice agreement with a non-significant Hosmer–Lemeshow Chi-square of 2.68 (P = 0.953) and 3.99 (P = 0.858) in the training and testing set (Fig. S3), respectively. These results all show that the model had good consistency.

Fig. 2: Development and validation of the nomogram model.figure 2

Developed nomogram for diabetic retinopathy (A), and the ROC curve and decision curves analysis curve of the Nomogram model, Rhee et al. model, Aspelund et al. model, Hippisley-Cox and Coupland model, and Dagliati et al. model in the training set (B, C) and testing set (D, E). Notes: nomogram model, thiamine triphosphate, systolic blood pressure, duration of diabetes; Rhee et al. model, glutamine/glutamate ratio; Aspelund et al. model, gender, systolic blood pressure, duration of diabetes and glycated hemoglobin; Hippisley-Cox and Coupland model, age, BMI, systolic blood pressure, cholesterol/high-density lipoprotein ratio, glycated hemoglobin; Dagliati et al. model, age, gender, duration of diabetes, BMI, glycated hemoglobin, hypertension, smoke; none, net benefit when all patients are considered as not having the outcome (diabetic retinopathy); all, net benefits when all patients are considered as having the outcome. The preferred model is the model with the highest net benefit at any given threshold. Abbreviations: MEDN430 thiamine triphosphate, sBp systolic blood pressure, DM_duration duration of diabetes.

Several models for early identification or diagnosis of DR have been reported [20,21,22,23]. The ability of each model was assessed by AUC (Table 2, Fig. 2). Among them, the AUCs for the nomogram in either the training set (AUC, 0.989; 95% CI, 0.974–1.000) or the testing set (AUC, 0.985; 95% CI, 0.954–1.000) were all significantly higher than those of previous models (P < 0.05). In addition, the cutoff value of the total points of the model in the training set was 79.11. According to the cutoff value, DM patients could be divided into high-risk and low-risk groups. And the sensitivity was 97.96%, the specificity was 93.88%, the accuracy was 95.92%, the positive predictive value was 94.12%, the negative predictive value was 97.87%, and the Youden index was 0.92 (Table 2). The model still had excellent classification ability in the testing set. The sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and Youden index were 95.00%, 100.00%, 97.50%, 100.00%, 95.24%, and 0.95, respectively (Table 2). As shown in Fig. 2, whether in the training set or testing set, the nomogram model performed outstandingly in various predictors regardless of the threshold, which ensured maximum clinical benefit.

Table 2 Comparison of the predictive ability of the Nomogram model and models constructed in previous studies.

In particular, we selected DR patients with different degrees from the DR group for sensitivity analysis. The above-established nomogram model still had an excellent ability in distinguishing diabetic without DR participants and patients with mild DR, and AUCs were 0.997 (95% CI, 0.987–1.000) and 1.000 (95% CI, 1.000–1.000) in the training set and testing set (Fig. S4). Similarly, the AUCs for moderate DR were 1.000 (95% CI, 1.000–1.000) and 0.964 (95%CI, 0.889–1.000), and the severe is 0.968 (95%CI, 0.925–1.000) and 1.000 (95% CI, 1.000–1.000).

留言 (0)

沒有登入
gif