Early diagnosis of Alzheimer’s disease using machine learning: a multi-diagnostic, generalizable approach

Datasets

Data used in the preparation of this article were obtained from the publicly available Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) and the Open Access Series of Imaging Studies (OASIS) project database. The most recent visit in which a diagnosis was made was considered the best available “ground-truth” to train the classifiers. Furthermore, the most recent diagnosis visit must have been at least 1 year after the selected scan for classifier training. The maximum follow-up time was of 3 years. Furthermore, diagnosis transitions must have occurred at least 6 months after the MRI scan. For each subject, we selected their earliest available structural MRI scan that fulfilled our study’s requirements (in the next section). Differences between diagnoses in age and sex (which are known to impact brain structure [33, 34]) and “time from MRI to most recent diagnosis” (which relates positively with the likelihood of diagnostic transitions) were estimated with 1-way ANOVAs (or, if residuals had non-parametric distributions, Mann-Whitney tests) and chi-square.

ADNI

The ADNI was launched in 2003 as a public-private partnership, led by principal investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. From ADNI, 570 subjects were included (211 HC, 188 MCI, 171 AD). To ensure diagnostic criteria equivalence across the different ADNI samples (ADNI, ADNI2, ADNIGO, and ADNI3), besides the diagnosis attributed by the clinician based on a clinical interview and exam results, subjects had to fulfill additional criteria based on the ADNI2 procedures manual. Specifically, HC must have a Mini-Mental State Exam (MMSE) score of at least 24 and a Clinical Dementia Rating (CDR) of 0; MCI patients must have an MMSE score of at least 24 and a CDR of 0.5 with a Memory Box score of at least 0.5; and AD patients must have an MMSE score below 27 and a CDR of at least 0.5.

OASIS

From the OASIS-3 dataset of the OASIS project, 531 subjects were included (463 HD, 70 AD) [35]. OASIS subjects had to fulfill the same MMSE and CDR requirements used for ADNI subjects and must have been diagnosed by a clinician based on a clinical interview and exam results.

MRI acquisitionADNI

All ADNI subjects underwent a T1-weighted 3.0 T MRI scan using a MPRAGE (TR = 2300 ms, TE = 2.84–3.25 ms, TI = 900 ms, FA = 8°/9°) or a IR-SPGR (TR = 6.98–7.65 ms, TE = 2.84–3.20 ms, TI = 400 ms, FA = 11°) sequence. Scans using parallel imaging acceleration techniques were excluded because, while some evidence shows that parallel imaging has little impact on brain atrophy rate measurements [15] and morphometric measures [36], differences have been found when using more complex longitudinal measures [37]. As ML methods are more sensitive to peculiarities in the data, acceleration could impact classifier performance. Out of 864 subjects in the ADNI dataset with 3.0 T MRI scan and a follow-up time of at least 1 year, 294 were excluded due to inclusion criteria for acquisition parameters, leaving the final sample of 570 subjects.

OASIS

All OASIS subjects underwent a T1-weighted 3.0 T MRI scan (TR = 2400 ms, TE = 3.16 ms, TI = 1 s, FA = 8°) using an MPRAGE sequence with a parallel imaging acceleration in-plane factor of 2 (not excluded as all acquisitions used parallel imaging).

MRI analysisMorphometric features

MRI preprocessing was performed using the FreeSurfer v6.0.0 standard pipeline [38]. The cortical structure of each hemisphere was parcellated into 34 regions of interest (ROIs). For each ROI, 8 measures were obtained (surface area, volume, average cortical thickness, standard deviation of the cortical thickness, integrated rectified mean curvature, integrated rectified Gaussian curvature, folding index, and intrinsic curvature index). Additionally, volumes were extracted from 40 subcortical ROIs and 64 white matter ROIs. This set of measures was decided a priori because they reflect morphometric alterations associated with MCI and AD, specifically brain atrophy [39], cortical thinning [40], cortical gyrification patterns [41], and white matter changes [42]. Finally, each hippocampus was segmented into 13 subfields [43], with 26 additional volumes being obtained. These hippocampal subfield volumes were extracted because this region has been extensively associated with AD and its progression [44]. Overall, 694 structural features were extracted.

Graph theory features

GT features were derived from the morphometric data. ROI volumes (209 features) were used to compute a binary graph. The edges of the graph were calculated with the following ratio:

Four different thresholds were used for graph binarization (0.3, 0.5, 0.7, and 0.9). From each subject’s graph, 4 node-wise measures were derived: degree (the number of nodes connected to a given node), clustering coefficient (the fraction of a node’s neighbors that are neighbors of each other), node betweenness centrality (the fraction of all shortest paths in the network that contain a given node), and eigenvector centrality (a measure of the influence of a node in a network). These measures were calculated using the Brain Connectivity Toolbox — comprehensive details about these measures can be found in the article accompanying the toolbox [45]. Overall, 836 GT-based features were computed for each subject at each threshold.

Machine learning classification

The proposed method for multi-diagnostic classification of HC, MCI, and AD is an ensemble of 3 binary classifiers. The approach for each binary classifier and their combination is illustrated in Figs. 1 and 2, respectively.

Fig. 1figure 1

Classification approach for binary classifiers. See the “Machine learning classification” section for a detailed explanation. l-SVM, linear support vector machine; DT, decision tree; RF, random forest; ET, extremely randomized tree; LDA, linear discriminant analysis; LR, logistic regression; LR-SGD, logistic regression with stochastic gradient descent learning; MCC, Matthews correlation coefficient; CV, cross-validation; X̄, average; σ, standard deviation. Green color indicates classifiers using only structural features and blue indicates classifiers using only GT features

Fig. 2figure 2

Classification approach for multi-class classifiers. See the “Machine learning classification” section for a detailed explanation. CV, cross-validation; SVM, support vector machine. Green color indicates classifiers using only structural features and blue indicates classifiers using only GT features

Binary classifiers for morphometric data

For each binary classifier, data was split into training (70%) and test (30%) sets in a stratified fashion, and 7 different classifiers were trained: (1) a linear support vector machine (l-SVM), (2) a decision tree (DT), (3) a random forest (RF), (4) an extremely randomized tree (ET), (5) a linear discriminant analysis classifier (LDA), (6) a logistic regression classifier (LR), and (7) a logistic regression classifier with stochastic gradient descent learning (LR-SGD). These algorithms were chosen for exposing feature importances and outputting probabilities. Feature importance was determined from the weight assigned to the features, except for tree methods (DT, RF, ET), for which it was determined based on impurity. Overall feature importances for the binary classifier were calculated by averaging the feature importances of each classifier voting in that binary decision.

During training, for feature selection and classifier parameter tuning, data was split into 5 stratified cross-validation (CV) folds, meaning 20% of the dataset was iteratively used for the testing and the remaining 80% for training. Within each fold, features were scaled between 0 and 1, and a percentile of best features (between 0 and 100% in steps of 10%) was selected using mutual information, ANOVA F-values, or chi-square statistics, and then the classifier was fit. A simple evolutionary algorithm [46] was used to select the best feature percentile, feature selection metric, and classifier hyper-parameters (tested hyper-parameters for each algorithm are in Supplementary Information 1). The evolutionary algorithm evolved over 10 generations, with 10 individuals per generation, a gene mutation probability of 30%, and a crossover probability of 50%, with a tournament size of 3. These parameters were selected based on preliminary tests run on a subset of the ADNI MPRAGE training set. We changed two of these hyper-parameters from their default values. First, we lowered the population size from 50 to 10, because our overall population size was relatively low, and also because this improved training speed. We also increased the gene mutation probability from 10 to 30% as we noticed classifiers were not converging after 10 generations or were getting stuck at local minima. The evolutionary algorithm was optimized for the MCC metric, which is more robust than more commonly used metrics such as accuracy or AUC [29, 30]. The 7 binary classifiers were ranked based on their performance in the training set folds, according to a custom ranking metric which selects for classifiers that have good performance and low variability across folds (mean MCC across folds minus the standard deviation of MCC across folds). Classifiers performing above the mean of the 7 classifiers were refitted on the entire training dataset using the tuned hyper-parameters. Finally, the selected classifiers were combined into a single one by simple voting.

Binary classifiers for GT data

The training of classifiers using GT data followed the same approach explained in the previous section, except that those 7 classifiers were trained for each graph-binarization threshold, for a total of 28 classifiers. Since GT features are harder to interpret, we only included GT-based classifiers in the voting if improvement was expected based on their performance (i.e., if they ranked above the selected morphometric-only based classifiers, replacing whichever of the selected morphometric-only classifiers were worst.)

Multi-diagnostic classifiers

To combine the output from the binary classifiers, we fitted a linear SVM classifier using a one-vs-one multi-diagnostic strategy on the higher-class probabilities output by each binary. We optimized the C parameter using a 5-fold stratified CV approach and refitted the classifier on the entire training set before testing. Overall feature importance for the multi-diagnostic classifier was calculated by weighting the feature importances of each binary classifier by the coefficient of the linear SVM which combines the binary classifiers.

Classifier performance evaluation

All reported performances are those of the test set. We opted to use a test set instead of a CV scheme because cross-validation cannot be used to estimate performance in all of our experiments, as for some experiments the training set data has different characteristics from the testing set data (e.g., in experiment A3, the training set is ADNI MPRAGE and the testing set is ADNI IR-SPGR, meaning there is only one possible split into training and testing, i.e., all ADNI MPRAGE being in the training set and all ADNI IR-SPGR being in the testing set). The test sets were age-matched across diagnosis groups, to ensure that any learning bias introduced by age would not be reflected in our test metrics. Furthermore, the test sets were split evenly between sexes to ensure that sex proportions in the training set would not inflate test results and to allow us to report data stratified by sex with the same confidence interval for both sexes. Finally, the test set had the same proportions of transitions as the training set (i.e., 30% of cases from each type of transition were assigned to the test set). 95% confidence intervals for each metric were estimated using 2000 bootstrapped samples of the test set. Similarly, to compare classifiers’ performance, two-tailed p-values were estimated using 2000 bootstrapped samples from the test set of each of the classifiers being compared, with statistical significance threshold set at 0.05. Balanced accuracy (BAC) was calculated as the average of recall obtained on each class. As PPV and NPV should be adjusted for the prevalence of the disease in the diagnostic tests to be useful in the clinical context [47], we report both using prevalence estimates from the first visit in the clinical setting (HC = 42.0%, MCI = 18.6%, AD = 26.3%, non-AD cognitive impairment = 13.1%) [48] and, additionally, at 50% prevalence as “standardized predictive values” for easier technical comparison with other predictive tests, as recommended [31].

ExperimentsUsing ADNI: generalizability across acquisition protocols

Using the ADNI dataset, we focused specifically on the impact of MPRAGE and IR-SPGR protocols on classifier performance. Five diagnostic classification tasks were performed on different subsets of the ADNI dataset (i.e., experiments A1–A5): (A1) MPRAGE only, (A2) IR-SPGR only, (A3) training on MPRAGE and testing on IR-SPGR, (A4) training on IR-SPGR and testing on MPRAGE, and (A5) full ADNI dataset. Of the 570 ADNI subjects, 423 had MPRAGE scans (149 HC, 145 MCI, 129 AD) and 152 had IR-SPGR scans (62 HC, 48 MCI, 42 AD). For classifiers using both types of scans, only the MPRAGE scan is used for the 5 subjects with both scan types. For classifiers using both datasets in training and testing, MPRAGE and IR-SPGR proportions were balanced across diagnosis to prevent the classifier from extrapolating diagnosis information from diagnosis irrelevant characteristics associated with the acquisition protocol. Experiments A6–A10 are the corresponding classifiers that received both morphometric and morphometric-derived GT metrics as inputs. Finally, a classifier was trained for the distinction between MPRAGE and IR-SPGR scans.

Using ADNI and OASIS: generalizability across datasets

The second set of experiments focused on the combination of the ADNI and OASIS datasets. Since the OASIS scanning protocol parameters were most similar to ADNI MPRAGE, we only used ADNI MPRAGE scans on this set of experiments. Five diagnostic classification tasks were performed on different combinations of the 2 datasets using morphometric features (i.e., experiments B1–B5): (B1) ADNI MPRAGE only, (B2) OASIS only, (B3) training on ADNI MPRAGE and testing on OASIS, (B4) training on OASIS and testing on ADNI MPRAGE, and (B5) training and testing on both OASIS and ADNI MPRAGE. As above, for classifiers using both datasets in training, the dataset was balanced across diagnosis. Experiments B6–B10 correspond to the classifiers that received both morphometric and morphometric-derived GT metrics as inputs. Additionally, a classifier was trained for the distinction between the OASIS and ADNI scans. A summary of all diagnostic classification experiments is shown in Table 1.

Table 1 Summary of diagnostic classification experimentsDiagnosis transition

Since we have follow-up data for all patients, we describe the ground-truth either as “stable” (meaning the diagnosis has not changed between the time of scanning and the most recent clinical follow-up) or as a “transition” (meaning the clinical diagnosis at the time of scanning is different from the most recent clinical follow-up available) and compare performance between “stable” patients and those who “transitioned” within the follow-up time. We did not exclude any transition cases, including diagnosis regression, to avoid artificially inflating model performance, which could arise from only including reliable or progressive diagnosis.

Clinical applicability potential

We sought to evaluate the potential for clinical applicability of our most inclusive diagnosis-wise and protocol-wise biomarker without GT features (“HC vs. MCI vs. AD”; experiment A5) and our most inclusive dataset-wise biomarker without GT features (“HC vs. AD”; experiment B5), as a first step towards informing on their potential clinical usefulness. For this, we utilized our own biomarker evaluation framework [32], which takes into account two dimensions, “Quality of Evidence,” which refers to the biomarker’s statistical strength and reproducibility, and “Effect Size,” which refers to its predictive power.

留言 (0)

沒有登入
gif