Machine learning algorithms identify demographics, dietary features, and blood biomarkers associated with stroke records

5. DiscussionThis study compared the ability of NHANES-derived dietary nutrients, blood biomarkers, clinical features, and the combination of these three data domains to classify individuals with a prior stroke or not using three ML models. We discovered that dietary nutrient intake contributed the least to the performance, followed by blood biomarker data. Models based on clinical features showed no difference in performance compared to those based on a combination of all three data domains. We subsequently extracted the ten most powerful features. Our external validation study showed that the performance from these features could be generalized. Subsequently, we summarized the profiles of training samples, provided a specific risk stratification for an individual, and developed a nomogram to facilitate the manual classification of stroke records. One main objective of this study is to provide highly accurate predictive models based on much fewer and more informative features without compromising the performance compared to the ML model using all the features. These set variables, which are also relatively easy to collect, can be used for the physicians or health authorities to have a preliminary and quick risk assessment on potential stroke status of high-risk patients. Since some of the variables have not been thoroughly investigated as risk factors for stroke (e.g., asthma [Corlateanu A. Stratan I. Covantev S. Botnaru V. Corlateanu O. Siafakas N. Asthma and stroke: a narrative review.] and snoring [Li J. McEvoy R.D. Zheng D. Loffler K.A. Wang X. Redline S. Woodman R.J. Anderson C.S. Self-reported snoring patterns predict stroke events in high-risk patients with obstructive sleep apnea: post-hoc analyses of the SAVE study.]), these SHAP features would also inspire future large-scale longitudinal studies or experiments to validate the potential causality or reveal the underlying mechanism.We referred to NHANES studies and compared the diet and blood features. However, features related to stroke records have been seldomly explored in NHANES. In an NHANES cardiovascular disease (CVD) and diabetes study [Dinh A. Miertschin S. Young A. Mohanty S.D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning.], we still found that top ten important features were almost clinical features, instead of diet or blood features; top fine were all clinical features. Therefore, we still did not identify any diet and blood biomarkers that contributed much to stroke records while they could have high predictive power in mortality. On the other hand, different forms of nutrient representations, e.g., nutrition indices, might be helpful. In a study on NHANES data, dietary features were shown to have a strong influence on stroke-related mortality [Micha R. Peñalvo J.L.L. Cudhea F. Imamura F. Rehm C.D.D. Mozaffarian D. Association between dietary factors and mortality from heart disease, stroke, and type 2 diabetes in the United States.], which contradicts our findings where diet was observed to have a lower significance. However, that study also included broad cardiometabolic food categories such as fruits, vegetables, and unprocessed meats as part of its dietary intake analysis, while the current study only utilized dietary nutrient and supplement data. Another study examined if CVD mortality prediction could benefit from nutrition data through ML algorithms; it involved all nutrition variables, including micronutrients (e.g., sodium and selenium), macronutrients (e.g., fat, carbohydrates, and protein), and commonly utilized composite nutrition indices (e.g., Alternate Healthy Eating Index, Mediterranean Diet Score, and the Dietary Approaches to Stop Hypertension diet score) [Machine learning with sparse nutrition data to improve cardiovascular mortality risk prediction in the USA using nationally randomly sampled data.]. The investigators revealed that micronutrients and macronutrients, instead of nutrition indices, improved the predictive capacity of ML-based models. However, when adopting conventional Cox modeling, such nutrient information was found to have little contribution to stroke prediction. In light of these inconsistent findings, we postulate that incorporating macronutrients and composite nutrition indices in the dietary dataset could help reexamine the importance of diet in stroke prevalence. On the other hand, in our H2O model, protein and vitamin B6 were ranked fourth and ninth out of their top ten features. Therefore, different forms of nutrients and their transformations in different models warrant further investigation.Agreements between ML model performances founded on laboratory- or non-laboratory-based information have been observed in previous studies that utilized NHANES data for CVD mortality predictions [Pandya A. Weinstein M.C. Gaziano T.A. A comparative assessment of non-laboratory-based versus commonly used laboratory-based cardiovascular disease risk scores in the NHANES III population.,Gaziano T.A. Young C.R. Fitzmaurice G. Atwood S. Gaziano J.M. Laboratory-based versus non-laboratory-based method for assessment of cardiovascular disease risk: the NHANES I follow-up study cohort.]. In those studies, the definition of non-laboratory mainly included clinical features plus dietary nutrients, while laboratory data included clinical features plus blood/urine biomarkers; therefore, they had an overlap of clinical features. Specifically, a recent study [Dinh A. Miertschin S. Young A. Mohanty S.D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning.] on diabetes mellitus and CVDs using the NHANES compared the non-laboratory with the laboratory dataset. The investigators noted a significant influence on CVD prediction from their non-laboratory variable-based model, while the laboratory-based model did not enhance performance. This was similar to our findings in that the addition of blood biomarkers and diet data did not significantly improve performance when comparing the union- with the clinical-dataset-based models. We made an extra comparison of the modified diet and blood set. We added the clinical set to both the diet and the blood set, and then found the AUROC and AUCPR were non-differential (results not shown). It meant that the overlapped non-laboratory variables might account for the non-differential performances. Moreover, in that recent study [Dinh A. Miertschin S. Young A. Mohanty S.D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning.], AUROC values were high (0.816–0.839). Two reasons could possibly explain it. The first is the data linkage from the dataset preprocessing step. The normalization was before the train-test split, which could result in unreliable better test performances because the training and the test set shared the same prior information (mean and standard deviation) from the original data for normalization. The second underlying cause is its moderately imbalanced CVDs data (17% vs. ours 3%). Classifiers could benefit more from moderately imbalanced data than severely imbalanced data [Veganzones D. Séverin E. An investigation of bankruptcy prediction in imbalanced datasets.]. On the other hand, that study both downsampled the training and the test sets, which modified the test structure and differed from ours.H2O Driverless AI is rarely used for cardiovascular disease. To the best of our knowledge, this is the first time this framework has been applied to the stroke study. Compared with other preventive studies on CVDs using automatic ML frameworks with AutoPrognosis (AUROC = 0.78), Auto-sklearn (AUROC =0.76), Auto-weka (AUROC = 0.75), and TPOT (AUROC = 0.74) [Alaa A.M. Bolton T. Di Angelantonio E. Rudd J.H.F.F. van der Schaar M. Cardiovascular disease risk prediction using automated machine learning: a prospective study of 423,604 UK Biobank participants.,Post-stroke diastolic blood pressure and risk of recurrent vascular events.], our H2O model produced a higher AUROC (0.804 for the union, 0.817 for the SHAP set and 0.832 for the external validation), so it is an effective classifier.As the data were cross-sectional, the explanation of the variables suggesting those who had strokes would be more challenging. In our global explanation (explanations for the sampled population), the ten most influential features (the SHAP set) were broadly consistent with what was reported in the literature, while other features, some (such as sedentary activity, diastolic blood pressure (DBP), and weight) seemed to need more explorations to understand how effective they are in assessing prior strokes for individuals. We reviewed more studies for other possible explanations for them. Take DBP as an example. DBP levels in the low-normal range after a stroke (Post-stroke diastolic blood pressure and risk of recurrent vascular events.]. On the other hand, several SHAP algorithms could be selected, and Deep SHAP was selected for two reasons. For one thing, we did not explore feature interactions in feature engineering manually. Feature explanation could involve interaction information from Deep SHAP because the DNN model is characterized by feature crossing in hidden layers. In addition, when we calculated the SHAP values using other models based on other algorithms, such as support vector machine (SVM) with Kernel SHAP algorithm and Catboost with Tree SHAP [A unified approach to interpreting model predictions, Adv. Neural Inf.,Huang G. Wu L. Ma X. Zhang W. Fan J. Yu X. Zeng W. Zhou H. Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions.], the performances of these models were lower than that of our DNN. Moreover, we found that some variables with less SHAP importance than our SHAP set features had interpretations changed in different models (especially DNN); for example, sedentary activity was a positively correlated feature in Catboost and BMI a negative feature in another trained DNN model while these two features had the opposite interpretations in our DNN model. This was possibly due to fluctuations of DNN performance for variables with small SHAP values. However, we found we could avoid possible changes in interpretations by focusing on the most important features, i.e., focusing on the top ten feature explanations. Consequently, in this study, we did not explore more explanations other than the top ten features.Due to the cross-sectional nature of the NHANES database, the scope of work of our study cannot be used for risk assessment. It also restricted the comparison between our study and traditional risk scores/Cox regression used in longitudinal studies to assess the years the analyzed people were at risk. However, in some NHANES studies, follow-up outcomes of CVDs were provided, but few solely focused on stroke [Machine learning with sparse nutrition data to improve cardiovascular mortality risk prediction in the USA using nationally randomly sampled data.,Pandya A. Weinstein M.C. Gaziano T.A. A comparative assessment of non-laboratory-based versus commonly used laboratory-based cardiovascular disease risk scores in the NHANES III population.,Gaziano T.A. Young C.R. Fitzmaurice G. Atwood S. Gaziano J.M. Laboratory-based versus non-laboratory-based method for assessment of cardiovascular disease risk: the NHANES I follow-up study cohort.]. In a cross-sectional NHANES study aiming at predicting CVD risk, which was most similar to our data structure, we observed that ML, rather than the Cox model, could benefit from the nutrition data and had higher performances [Machine learning with sparse nutrition data to improve cardiovascular mortality risk prediction in the USA using nationally randomly sampled data.]. Therefore, pertaining to our data structure, ML might still have the potential to capture the complexity of nonlinear relationships. It is speculated that ML could outperform the Cox models or traditional risk scores if follow-up data are available for stroke outcomes.There are limitations in our study that need to be addressed. First, this dataset is cross-sectional in nature, and therefore detected associations might be less robust than other study designs based on prospectively collected data. Although cross-sectional studies can provide information on the prevalence of a particular disease, which is helpful in planning public health interventions [Principles of tumors: a translational approach to foundations.], our result cannot be used for inference of causality or treatment decision. However, as we did not infer the causality between the outcome and the exposures, there should not be temporal bias.

Second, our outcome was self-reported in the questionnaire, which might suffer from selection bias or reporting bias. For selection bias, NHANES utilized national representatives through complex sampling and manual quality control to mitigate the selection bias; we also involved the sampling weights in the modeling to adjust the effect size for the whole population and to evaluate the selection bias. For the reporting bias assessment, we adopted UKB hospital inpatient data to correct the labels.

Third, as a matter of concern, is the proportion of observation deletion and feature deletion. We excluded: 1) observations with non-valid (‘Refused to answer’ or ‘Don't know’) responses in covariates; 2) observations with unlabeled (missing values and non-valid responses in outcomes) data; 3) in NHANES 2015–16 for model development, features with over 30% missing values. We could infer the biases were small. First, the non-valid answers were out of the scope of our analysis. We want to explore features that are clear in definitions to the medical practitioners/caregivers; excluding non-valid answers might magnify the effect of valid answers. Additionally, the total proportions for these answers were small, ranging from 14%–18% of raw data for NHANES and 23% for UKB. So, the bias seemed to be small by deleting non-valid answers in covariates. Then, for the unlabeled data, the sensitivity analysis suggested low bias, with 0.02 AUROC reduction from over 200% sample size in NHANES 2015–16. Finally, for feature deletion, we deleted mainly the biomarkers with over 30% missing values (including selenium, cadmium, and LDL-cholesterol); they tended to be collected by subgroups and ended up with a small quantity. Still, we had large amounts of features in each dataset (30 to 122) in model development, and some of them could be the proxy variables for the deleted variables (e.g., LDL cholesterol could be approximately derived from Total cholesterol and HDL cholesterol), thereby reducing the bias of lost information from the deleted features.

Fourth, related to feature deletion was that many important established blood biomarkers were not involved in our modeling. NHANES lacked some important stroke-associated variables, such as lipoprotein-associated phospholipase A2, D-dimer, and interleukin 6, etc., of which the involvement might further promote the performance. However, based on the results of blood biomarkers, we deduced that these lacking laboratory biomarkers might still be outperformed by the SHAP set in our modeling. Moreover, we prioritized using several features that are the informative and easily collected features to improve the simplicity and utility of our nomogram model and gain a higher generalization capability of the model.

Fifth, the NHANES data still lacks information on other features. For example, arthritis history had 5708 observations, while the type of arthritis had only 1433; thyroid history had 5806, while current thyroid status only had 593. Thus, the more informative features like arthritis type and current thyroid were with small sample size and therefore deleted because of too many missing values, although it also resulted in bias in our results.

Despite the limitations, our results, using a few questionnaire-based clinical variables, have high and robust performances and can potentially provide generalized and specialized characteristics for stroke survivors for a newer estimate of stroke prevalence.

Appendix A. Data and methodologies

Data

Clinical and second clinical set variables from demographic, examination, and questionnaire data

The lists for the clinical set and the second clinical set included: 1) influential factors: high blood pressure, smoking, diabetes, physical inactivity, obesity, high blood cholesterol, heart diseases, sickle cell disease, age, race, gender, income, alcohol, drug abuse, sleep habits, oral health, gout, asthma, angina, thyroid, cancer, and hepatitis; 2) symptoms: face, limb weakness/numbness, speech slurred (confusion), trouble seeing, walking and severe headache; 3) complications: urinary tract infection and/or bladder control, pneumonia, swallowing problem, clinical depression, shoulder pain/anxiety, breathing problems, aspirin. Moreover, possible confounders, including diabetes risk, taking insulin, medication for depression, anxiety, and cholesterol, were also added. The clinical set consisted of the influential factors, while the second clinical set was a combination of the above variables.

Blood biomarkers from laboratory data

A list of blood biomarkers, including oxidative stress, metabolic and inflammatory ones, was compiled. It included: 1) metabolic: calcium, iron, cadmium, chloride, total cholesterol, triglyceride, percentage of segmented neutrophils, red cell distribution width, glycohemoglobin, potassium, sodium, high-density lipoprotein cholesterol (HDL C), folic acid, and glucose; 2) inflammatory: white blood cell count, hematocrit and platelet count, aspartate and alanine aminotranferase, gamma glutamyl transferase, lactate dehydrogenase, creatinine, high-sensitivity C-reactive protein, Monocyte/HDL-C ratio, hemoglobin; 3) oxidative stress: Segmented neutrophils/Lymphocyte ratio, total bilirubin, uric acid; 4) neurohormone: cotinine.

Dietary nutrients from dietary data

The American Heart Association Diet and Lifestyle Recommendations encourages eating a variety of nutritious foods from all the food groups and eating less nutrient-poor foods to fight cardiovascular disease. In our work, we involved all the nutrients that comprise foods or beverages in our dietary intake for stroke prediction; dietary supplement was also counted. These nutrients included: 1) nutrients that offered the most calories: carbohydrates, sugars, total fats, and HDL C, protein, fiber, saturated and unsaturated (monounsaturated and polyunsaturated) fatty acids; 2) vitamins: vitamin A/B1/B2/B6/B12/C/D/E/K, alpha-carotene, beta-carotene, lycopene, lutein, riboflavin, niacin, folic acid, B-cryptoxanthin, theobromine, and folate; 3) minerals: sodium, phosphorus, zinc, intake of such as potassium, magnesium, iron, copper, and selenium; 4) other nutrients: water, alcohol, and caffeine.

Variable definitions and extraction

After the variables mentioned above were extracted from NHANES 2015–16 database in different files, they were merged according to the unique ID ‘SEQN’ to generate the four datasets. The occurrence of stroke was determined by the subject's answer to the questions ‘Has a doctor or other health professional ever told that . . . had a stroke?’. Out of 9575 individuals, 5714 answered with either ‘Yes’ (209) or ‘No’ (5505), 3856 had missing values, and five individuals refused to answer or wrote ‘Don't know’. Based on this stroke proportion, other features were merged or removed if there were over 30% of values missing or replaced by similar and proxy variables if available. For example, smoking was defined based on the response to the question ‘Smoked at least 100 cigarettes in life?’ rather than ‘Do you now smoke cigarettes?’ because the latter produced over 30% missing values. Observations with ‘Don't know’ or ‘Refused to answer’ answers of stroke would also be deleted, but they would be counted in the semi-supervised model. Systolic blood pressure (SBP) and diastolic blood pressure (DBP) in the examination data were obtained from three consecutive readings after the subject had rested for five minutes in a seated position and after the determination of the maximum inflation level. The fourth read would be needed if a BP measurement is interrupted or incomplete. The values for SBP and DBP were then averaged by ourselves from the three readings. Urine biomarkers in the ‘Laboratory Data’ were excluded because of the small sample size and few variables after deleting variables with over 30% missing values, but the ‘Dietary Data’ dietary supplement was considered and added to the total nutrient intakes. Moreover, ‘Added alpha-tocopherol (Vitamin E) (mg)’ and ‘Added vitamin B12 (mcg)’ were also added to ‘Vitamin E as alpha-tocopherol (mg)’ and ‘Vitamin B12 (mcg)’. The final values for diet were averaged by ourselves from the first and the second-day records.

Methodologies

Datawig imputation

DataWig imputation, which can be used for numerical, categorical, and unstructured text data [Bießmann F. Rukat T. Schmidt P. Naidu P. Schelter S. Taptunov A. Lange D. Salinas D. DataWig: missing value imputation for tables.],was adopted in this study. Inspired by established approaches [Flexible Imputation of Missing Data.], Datawig follows the process of multivariate imputation by chained equations (MICE) [van Buuren S. Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R.]. First, for strings and character sequence features, they are dealt with string and numeric representation and then further transformed into embeddings and hashing from (Long Short Term Memory) LSTM [Sundermeyer M. Schlüter R. Ney H. LSTM neural networks for language modeling.] or n-gram [Cheng W. Greaves C. Warren M. From n-gram to skipgram to concgram.]; for numeric features, they are transformed into embeddings. Then all embeddings are concatenated and finally fitted with a regression or cross-entropy loss in terms of the type of missing value. DataWig compares favorably with other implementations (mean [Young W. Weckman G. Holland W. A survey of methodologies for the treatment of missing values within datasets: limitations and benefits.], k-nearest neighbor (KNN) [Nearest neighbor selection for iteratively kNN imputation.], matrix factorization [Koren Y. Bell R. Volinsky C. Matrix factorization techniques for recommender systems.], MissForest [Flexible Imputation of Missing Data.], MICE [van Buuren S. Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R.]) for numeric and unstructured text imputation, even in the complex condition of missing-not-at-random [Bießmann F. Rukat T. Schmidt P. Naidu P. Schelter S. Taptunov A. Lange D. Salinas D. DataWig: missing value imputation for tables.].

Feature selection by BoostARoota

We used BoostARoota [] to filter out redundant features and select important ones. BoostaARoota, as a modified version of Boruta algorithm [Feature selection with the Boruta package.], is a wrapper feature selection algorithm. Compared to Boruta, BoostARoota uses Xgboost [XGBoost: a scalable tree boosting system.] as the base model and modifies the feature elimination process, being computationally faster than Boruta. We first repeated BoostARoota for thirty times (by changing its random seed) to choose the overlaps as the robust features to be used for further analysis.

Imbalance classification analysis

In general, there are two strategies to handle class imbalance classification, i.e., data-level approach and algorithm-level approach. [Ali A. Shamsuddin S.M. Ralescu A.L. Classification with class imbalance problem: a review.]The data-level approach employs a preprocessing step to rebalance the class distribution. Samplings, as a preprocessing step, are very effective methods to address class imbalance [The class imbalance problem: Significance and strategies.] and have been proven to improve the predictive power of modeling in class-imbalanced datasets [Resampling methods improve the predictive power of modeling in class-imbalanced datasets.]. Our H2O model adopted a sampling method to adjust skewed stroke distribution to improve training. In addition to sampling methods, feature selection is another preprocessing step gaining popularity in class imbalance classification tasks. Feature selection removes irrelevant, redundant, or noisy data present in the problem of class overlapping in class imbalance [Ali A. Shamsuddin S.M. Ralescu A.L. Classification with class imbalance problem: a review.,Cuaya G. Munoz-Meléndez A. Morales E.F. A minority class feature selection method.]. We applied feature selection to reduce features to be used in our nomogram.The algorithm-level approach, where the algorithms are fine-tuned to improve the learning of smaller classes, includes one-class learning and cost-sensitive learning (16). Our IF model is a one-class classification algorithm aiming at outlier or anomaly detection [Chen W.-R. Yun Y.-H. Wen M. Lu H.-M. Zhang Z.-M. Liang Y.-Z. Representative subset selection and outlier detection via isolation forest.]. It can be effective for imbalanced classification datasets where stroke cases are both few in number and different in the feature space. Our DNN is a Cost-Sensitive Neural Network (25). It is trained with the Focal Loss function (29) to assign a larger error weight to stroke cases and reshape the standard cross-entropy loss to improve class-imbalance learning during the standard DNN training. Our LR adjusts observation weights inversely proportional to stroke frequencies in the training data to improve class imbalance training [Learning with positive and unlabeled examples using weighted logistic regression.].Thresholding is another cost-sensitive approach that is applied at the data level in a postprocessing step, aiming to identify the optimal decision thresholds for classification. [Esposito C. Landrum G.A. Schneider N. Stiefl N. Riniker S. GHOST: adjusting the decision threshold to handle imbalanced data in machine learning.] For binary classification, 0.5 is typically the threshold, while it may be biased with respect to the major class in imbalanced data [Esposito C. Landrum G.A. Schneider N. Stiefl N. Riniker S. GHOST: adjusting the decision threshold to handle imbalanced data in machine learning.,Zhang X. Gweon H. Provost S. Threshold moving approaches for addressing the class imbalance problem and their application to multi-label classification.], so using decision thresholds is an alternative technique that can deal with class imbalance [Collell G. Prelec D. Patil K. Reviving Threshold-Moving: a Simple Plug-in Bagging Ensemble for Binary and Multiclass Imbalanced Data.]. The Youden index is a linear transformation of the mean sensitivity and specificity. It can define thresholds to avoid failure in evaluating the algorithm's ability and is applied in imbalance cases due to its invariance to imbalance ratios. [Starovoitov V.V. Golub Y.I. Comparative study of quality estimation of binary classification., Pena F.A.G. Fernandez P.D.M. Tarr P.T. Ren T.I. Meyerowitz E.M. Cunha A. J regularization improves imbalanced multiclass segmentation., Usman M. Khan S. Lee J.A. AFP-LSE: antifreeze proteins prediction using latent space encoding of composition of k-spaced amino acid pairs.] Therefore, we also used the Youden index to define thresholds for classification.

SHapley Additive exPlanations

SHapley Additive exPlanations (SHAP), as a tool for model interpretation, is based on a game-theoretic approach and can explain the output of any ML model [Zhang K. Zhang Y. Wang M. A unified approach to interpreting model predictions Scott.]. Thus it could provide clinical value for interpreting the influential factors of stroke. SHAP was developed to solve the problem of inconsistency in many feature attribution methods. The so-called inconsistency means that the role of a certain feature in the model plays an important role, but the calculation methods of the importance of the feature, such as “Gain”, “Split”, and “Saabas”, assign a lower importance value to it; SHAP guarantees this consistency in theory [Lundberg S.M. Erion G.G. Lee S.-I. Consistent Individualized Feature Attribution for Tree Ensembles.].SHAP assigns each feature an importance value affecting a particular classification with that feature. For a given feature, featurei, the Shapley value [A unified approach to interpreting model predictions.] is the weighted average of all possible marginal contributions of featurei, which is used as the feature attribution. Eq. (1) presents a classic Shapley value estimation [A unified approach to interpreting model predictions.] for featurei.

SHAPfeatureix=∑set:featurei∈subsetsubset∗Fsubset−1Predictionsubsetx−Pred

留言 (0)

沒有登入
gif