Development and validation of a LASSO prediction model for cisplatin induced nephrotoxicity: a case-control study in China

This study utilized machine learning algorithms to construct a CIN prediction model based on clinical, laboratory, and genetic variables. The construction process was conducted strictly to the statement of clinical prediction models as follows: developing the prediction model, validating the prediction model, and predictive effectiveness evaluation [24]. The model demonstrated good sensitivity and specificity, indicating that combining laboratory and clinical variables can effectively identify high-risk populations of CIN. While the model cannot be used as an independent diagnostic method, it can serve as a supplementary tool due to its common, objective, and easily obtainable predictive factors.

The predictive set factor included 69 feature variables, 8 of which were genetic. If the genetic variables were considered as dummy variables, the total number of variables would increase to nearly 80. we employed LASSO regression with a 1sd penalty coefficient to consolidate the laboratory variables. This method effectively reduced the number of predictors and eliminated unimportant variables. LASSO is a method of shrinkage estimation based on model reduction. By constructing different penalty functions, the regression coefficients of variables will decrease accordingly, and the regression coefficients of unimportant variables will eventually decrease to zero. Compared with the classical screening method, Lasso can effectively avoid the influence of factors such as different orders of magnitude, different units and possible collinearity between variables [25]. To screen candidate variables, we opted for Lasso regression over classic single factor regression, using a 1 standard deviation penalty coefficient lambda (λ) as the screening parameter to prevent the exclusion of relatively unimportant variables [7, 26, 27]. The LASSO algorithm was executed using the “glmmet” R package, while the logistic regression model was constructed using the “glm” R package [20]. Subsequently, we employed multifactor logistic stepwise regression to identify a concise and effective set of variables, which were then fitted into the formula based on their respective weights. This standardized approach to variable selection and weight conversion helps mitigate differences in the same indicator arising from different laboratory methods [13, 28].

In the traing set, the genetic variable rs3212986 of ERCC1 exhibited statistically significant differences in allele frequency and genotype characteristics between the CIN group and the control group. The proportion of A-allele carriers was higher in the CIN group (31.21%) than in the control group (24.92%). The proportions of AA, CA, and CC genotypes were 11.64%, 39.15%, and 49.20% in the CIN group, and 12.03%, 25.64%, and 62.32% in the control group, respectively. These findings suggest that carriers of the A allele of rs3212986 are more likely to develop CIN, which is consistent with previous studies [29]. Similarly, the allele frequency and genotype characteristics of rs920829 of TRPA1 were also statistically different between the CIN group and the control group. The proportion of T allele carriers was lower in the CIN group (22.75%) than in the control group (28.69%). The proportions of TT, CT, and CC genotypes were 8.46%, 28.57%, and 62.96% in the CIN group, and 16.96%, 23.47%, and 59.57% in the control group, respectively. These results suggest that T allele carriers of rs920829 are less likely to develop CIN. However, during the optimization of variables through multiple factor logistic regression, neither rs3212986 nor rs920829 were incorporated. It is possible that these variables lack independent predictive power or their independent predictive value is not significant enough [30].

Cystatin-C (Cys-C) was identified as the independent risk factor with the highest odds ratio (OR) value in the prediction model, surpassing other factors in predictive performance. The reasons for the increase of Cys-C and the high risk of CIN are analyzed as follows: 1) Cys-C is produced by all nucleated cells in the body. Cys-C in the blood is filtered by the glomerulus, and is degraded through reabsorption of the renal tubules, and is not secreted through the renal tubules. The progress makes it a more effective indicator of early glomerular filtration function than creatinine, urea nitrogen, and other indicators [31, 32]. Secondly, Cys-C is a member of the cysteine protease inhibitor family and an imbalance between cathepsin and protease inhibitors may lead to tumor invasion and metastasis, which can also promote an elevation of Cys-C [33, 34]. Other factors in the model, such as dbil and LDH, were not traditional renal function indicators or related to cisplatin metabolism pathway, but may reflect changes in physiological or pathological pathways during the occurrence and development of CIN (such as secretion and excretion, inflammatory response, oxidative stress damage, and electrolyte imbalance) during the occurrence and development of CIN [27]. Therefore, using appropriate weighted models for joint evaluation can can aid in the earlier identification of CIN risks.

The model showed high sensitivity and negative prediction value(NPV), which can help to recognize the high risk of CIN and remind clinical attention to the selection of chemotherapy regimen and the compatibility with drug dosage. The results also showed a satisfactory discrimination ability and a prediction curve that is close to the actual curve, which indicates that the model can provide prediction results that are highly consistent with the actual ones to identify cases with high risk of CIN. The model had a C-index = 0.922 for the traing set’s discriminant test, with the consistency test S: P = 0.790, Emax = 0.044, Eave = 0.007 and S: p = 0.790, suggesting both the model’s discriminant and consistency were good. To avoid overfitting of the model due to random and systematic errors, a validation model was constructed from aother prospective dependent set data. The fitting of the model constructed from the test set data is consistent with the fitting of the model constructed from the traing set data. Further clinical decision curve analysis of the model revealed that the model was of good value for clinical use when the high-risk threshold was between 0.1 and 0.9. Meanwhile, Recision-Recall curve shown in recall interval from 0.5 to 0.75: precision gradually declines with increasing Recall, up to 0.9.

The prediction model developed in this study has certain limitations. Firstly, it is a single-center study, and although the test set data was prospectively included, the test set data was obtained retrospectively from the electronic medical record system. Consequently, there were unavoidable factors such as missing data, resulting in a final traing set of 696 patients, which may limit the model’s scalability and necessitate further multicenter research and external validation. Secondly, the study did not incorporate the latest CIN-related biomarkers, such as malondialdehyde (MDA), NADPH oxidases (NOX), or heme oxygenase 1 (HO-1), which could potentially impact the results [2]. Future research should focus on gradually conducting validation studies across multiple centers to continuously refine and enhance the model and provide guidance for clinical practice.

留言 (0)

沒有登入
gif