Evaluation of machine learning-based classification of clinical impairment and prediction of clinical worsening in multiple sclerosis

Study population

Data were retrospectively collected from the early MS cohort of Berlin, Germany (32 people diagnosed with clinically isolated syndrome (CIS) and 93 pwMS) [25, 26] and people with clinically definite MS from the Amsterdam MS cohort, the Netherlands (330 pwMS) [27, 28]. All included subjects were over the age of 18 and had a clinical assessment and a structural MRI examination available at baseline. Clinical measurements included EDSS, 9HPT, T25FW, and SDMT score. The early MS cohort of Berlin had a 2-year clinical follow-up available for all 125 included subjects. For Amsterdam, a 5-year clinical follow-up was available for 225/330 included subjects. The institutional ethics review boards of both institutions (Amsterdam UMC, Amsterdam and Charité, Berlin) approved the study protocol and subjects gave written informed consent prior to participation.

MRI acquisition

All subjects underwent a 3T MRI examination including the following pulse sequences: 3D T1-weighted (3D-T1) and 3D fluid-attenuated inversion recovery (3D-FLAIR). The scanning protocol in Berlin included a 3D-T1 magnetization prepared rapid acquisition gradient echo sequence (1.0 × 1.0 × 1.0 mm resolution, repetition time (TR) = 1900 ms, echo time (TE) = 3.03 ms, inversion time (TI) = 900 ms, flip angle = 9°) and a 3D-FLAIR sequence (1.0 × 1.0 × 1.0 mm resolution, TR = 6000 ms, TE = 388 ms, TI = 2100 ms), using a Tim Trio scanner (Siemens Medical Systems, Erlangen, Germany). The scanning protocol in Amsterdam included a 3D-T1 fast-spoiled gradient-echo sequence (1 × 0.9 × 0.9 mm resolution, TR = 7.8 ms, TE = 3 ms, TI = 450 ms, flip angle = 12°) and a 3D-FLAIR sequence (1.2 × 1.0 × 1.0 mm resolution, TR = 8000 ms, TE = 125 ms, TI = 2350 ms), using a GE Signa HDxt scanner (Milwaukee, WI).

MRI processing

T2-lesion volumes (T2LV) were determined on 3D-FLAIR. In Berlin, lesions were manually segmented using ITK-SNAP (www.itksnap.org) by two expert MRI technicians [26]. In Amsterdam, lesions were automatically segmented using a k-nearest neighbor algorithm and visually checked [28,29,30]. To reduce lesion-associated brain tissue segmentation bias, lesions were filled with values approximating normal WM on 3D-T1 [31]. Whole-brain, CGM, and DGM segmentations were derived for both centers using the FreeSurfer 7.1.1 (http://surfer.nmr.mgh.harvard.edu/) recon-all pipeline on lesion filled 3D-T1. Subsequently, the cortical surface of each subject was parcellated into 210 regions using the Brainnetome Atlas (BNA) [32]. The volumes of the left and right regions were averaged to decrease the number of input features without losing too much anatomical information, resulting in 5 global volumes (whole-brain volume (WBV), CGM volume, DGM volume, lateral ventricular volume (LVV), cortical cerebellum volume), 105 CGM regional volumes, and seven DGM regional volumes (thalamus, accumbens, putamen, caudate, pallidum, amygdala, hippocampus). All volumes, except for T2LV, were corrected for head size by dividing the volume by the estimated total intracranial volume (eTIV).

Overview of machine learning approach

Based on different sets of clinical and MRI features, we trained different ML algorithms for classifying clinical impairment and predicting clinical worsening. The input features included demographic (age, sex) and clinical information (symptom duration, MS subtype, use of disease-modifying therapy (DMT)) as well as structural MRI volumes. The outcome measures for classifying clinical impairment included EDSS (cutoff ≥ 4) to define high disability or SDMT (Z-score ≤ −1.5) to define cognitive impairment. Clinical worsening was evaluated over a follow-up of 2 years (Berlin) and 5 years (Amsterdam), based on EDSS, 9HPT, T25FW, or SDMT scores. The significance of classification and prediction models was assessed with permutation testing and the most important clinical and MRI features were determined using Shapley additive explanations (SHAP). See Fig. 1 for an overview of the machine learning approach.

Fig. 1figure 1

Overview of the machine learning (ML) approach. Input features for ML included different sets of clinical variables and MRI volumes. Different ML approaches were evaluated to classify clinical impairment and predict clinical worsening. Shapley additive explanations were applied to identify the most important clinical and MRI features for classification and prediction. DGM deep gray matter; BNA Brainnetome Atlas; LR logistic regression; SVM support vector machine; RF random forest; XGBoost eXtreme Gradient Boosting; EDSS Expanded Disability Status Scale; SDMT Symbol Digit Modalities Test; 9HPT 9-Hole Peg Test; T25FW Timed 25-Foot Walk Test

Clinical impairment

Clinical impairment at baseline was defined using the EDSS or SDMT. PwMS were classified as having a low or high disability based on an EDSS ≥ 4 cutoff. For classifying preserved or impaired cognition, SDMT standardized Z-scores were calculated based on German normative healthy control data for the Berlin cohort [33] and norm scores of matched healthy controls for the Amsterdam cohort [28]. PwMS reporting an SDMT Z-score below −1.5 were considered as cognitively impaired.

Clinical worsening

Clinical worsening was assessed using the EDSS, 9HPT, T25FW, a combination of EDSS, 9HPT, and T25FW (EDSS +), or SDMT. EDSS-based worsening was defined as an increase in EDSS of  ≥ 1.5 points for a baseline score of 0, an increase of  ≥ 1.0 for a baseline score between 1.0 and 5.5, or an increase of  ≥ 0.5 for a baseline EDSS score of  ≥ 6.0 [34]. For 9HPT and T25FW, clinically meaningful worsening was defined as a 20% increase in the time required to finish the test compared to the baseline measurement [35, 36]. For 9HPT, worsening in the non-dominant hand and dominant hand was assessed separately. Worsening on EDSS + was defined as worsening on  ≥ 1 of the three components (EDSS, T25FW, or 9HPT (dominant or non-dominant hand)) [21]. Lastly, SDMT worsening was defined as an increase of  ≥ 4 points compared to the baseline measure [37].

Input features

To compare the performance of clinical and MRI-derived features for classifying clinical impairment and predicting clinical worsening, and assess which combination would result in the highest performing models, five different feature sets were defined: (1) clinical data, (2) global MRI volumes, (3) clinical data + global MRI volumes, (4) regional MRI volumes, (5) clinical data + regional MRI volumes. See Table 1 for an overview of included features in each feature set. Clinical data included age, sex, symptom duration, MS subtype (CIS, RRMS, or progressive MS), and use of DMT (yes or no).

Table 1 Combinations of clinical and MRI features for classifying clinical impairment and worseningMachine learning model training

Five ML algorithms were compared for classification of clinical impairment and prediction of worsening for the different clinical outcomes. A comparison of multiple ML models was conducted because these models have varying abilities to capture linear and non-linear relationships between input features [38]. As linear classifiers, logistic regression (LR) and support vector machine with a linear kernel (SVM-lin) [39] were selected due to their robust performance demonstrated in prior structural neuroimaging studies [40, 41]. Their performance was compared to three non-linear models that have been successfully applied in other studies [42]: SVM with a radial basis function kernel (SVM-RBF) [43], eXtreme Gradient Boosting Classifier (XGBoost) [44], and random forest (RF) [45]. The ML pipelines were implemented in Python 3.9.15 using the scikit-learn [46] and xgboost [47] software packages. Due to variations in demographics, follow-up times, and the use of different MRI scanners and protocols between the cohorts, it was unfavorable to employ one cohort as a training set and the other as a validation set. Consequently, machine learning models were independently trained on the data from each center. Preprocessing of all input features included standardization by removing the mean and scaling to unit variance [48]. In addition, random oversampling of the minority class was used in the preprocessing pipeline to account for class imbalance in model training. Due to the relatively low sample size of our data, stratified Monte–Carlo cross-validation with ten repetitions (i.e., 10 randomly selected test sets) was performed using an 80%/20% train/test split ratio to avoid evaluation bias resulting from sampling effects [49, 50]. A stratified 5-fold cross-validation was applied within the training set of each repetition for hyperparameter optimization [51]. Since not all clinical outcome variables were available for each participant, train and test sets were different for each clinical outcome due to random selection. Model performance was assessed using several metrics: area under the curve (AUC) of the receiver operating characteristics curve, balanced accuracy (BA), precision, and recall (or sensitivity). AUC assesses the model's ability to distinguish between classes by plotting the true positive rate against the false positive rate across all possible classification thresholds. Balanced accuracy represents the average of recall and specificity, providing a balanced view of the model's performance across both classes. Precision quantifies the proportion of true positives among all positive predictions, highlighting the model's accuracy in predicting positive instances. Recall, also known as sensitivity, measures the proportion of actual positives correctly identified by the model, indicating its effectiveness in detecting positive instances. The final performance of the models was ranked based on the average and standard deviation of the AUC across the 10 repetitions. Statistical significance of the best performing ML model for each outcome measure was determined using permutation testing [52]. To reduce computational time, the p value was derived based on 100 permutations.

Model explanations

To understand which combination of clinical and MRI features were most relevant for clinical impairment and worsening predictions in pwMS, SHAP values were calculated in a post hoc analysis. SHAP is a local model explanation method aimed to explain the model prediction for each subject by computing the relative importance of every input feature for the final prediction [53]. SHAP values explain the difference between the individual prediction and the average prediction. For an individual, the sum of all SHAP values equals the difference between their prediction and the average probability of clinical impairment or worsening. Global ranking of feature importance was defined as the mean absolute SHAP value of each feature across all subjects and all test sets [54]. SHAP values were calculated with the KernelSHAP method implemented in Python.

留言 (0)

沒有登入
gif