Development of a robust parallel and multi-composite machine learning model for improved diagnosis of Alzheimer's disease: correlation with dementia-associated drug usage and AT(N) protein biomarkers

1 Introduction

Neurodegenerative diseases pose a significant challenge in contemporary medicine, presenting a substantial burden on healthcare systems worldwide and affecting the quality of life for millions globally (Whiteford et al., 2015). These disorders, exemplified by the gradual degeneration of the nervous system's structure and function, manifest in a myriad of ways, affecting cognition, motor skills, and overall neurological wellbeing. Neurological disorders affect roughly 15% of the global population at present (Feigin et al., 2020). Over the last three decades, the actual number of affected individuals has substantially increased.

After conducting a comprehensive analysis of various neurological conditions such as Alzheimer's Disease (AD), Huntington's Disease, Parkinson's Disease, and amyotrophic lateral sclerosis, dementia becomes evident as a significant outcome of neurological deterioration, with AD being the most prominent (Ritchie and Ritchie, 2012; Ciurea et al., 2023; Khan et al., 2023). Neurodegenerative disorders are multifaceted, and thus it is complex to diagnose since genetic, environmental, and age-related factors cause them.

Cognitive decline, a characteristic of AD and related dementias, encompasses a spectrum of cognitive impairments ranging from subtle changes in memory and thinking abilities to severe cognitive dysfunction affecting daily functioning (Whiteford et al., 2015; Feigin et al., 2020). Different etiologies contribute to cognitive decline, including neurodegenerative processes such as AD, vascular pathology, Lewy body disease, and other less common causes. Syndromic diagnosis, which includes subjective cognitive decline, mild cognitive impairment, and dementia stages of AD, plays a crucial role in characterizing the progression of cognitive decline (Ritchie and Ritchie, 2012; Ciurea et al., 2023).

Dementia currently has a staggering societal cost, accounting for 1.01% of global GDP (Mattap et al., 2022). This issue is expected to worsen in the coming years, with an estimated 85% increase in global societal costs by 2030, assuming no changes in potential underlying causes (e.g., macroeconomic aspects, dementia incidence and prevalence, treatment availability, and efficacy). As stated in the World Alzheimer Report 2023, the World Health Organization (WHO) warns of a rising global prevalence of dementia, which is anticipated to rise from 55 million in 2019 to 139 million by the year 2050. As societies continue to age, the related expenses of dementia are estimated to double, from $1.3 trillion in 2019 to $2.8 trillion by 2030 (Better, 2023).

AD poses a significant challenge to the global health landscape and is characterized by its relentless progression as a neurodegenerative disorder. The global rise in AD cases is directly linked to the aging population, with a rising number of people surviving above the age of 65 (the key age group prone to AD) (Jaul and Barron, 2017). The growing prevalence of AD in an aging population underscores the pressing need for a comprehensive understanding of the condition and the development of innovative diagnostic techniques (Saleem et al., 2022; Alqahtani et al., 2023). The defining characteristic of AD is the gradual deterioration of cognitive abilities, which ultimately compromises the quality of life for those affected. This disorder not only has a detrimental effect on memory but also disrupts various cognitive, behavioral, and daily functioning (Khan et al., 2023). The implications of this condition extend beyond those affected, impacting their families and caregivers and imposing a burden on healthcare systems globally.

The traditional diagnostic methods, however valuable, often fail to deliver timely and precise diagnoses. AD detection needs a more refined and advanced technique due to its complicated nature. Such an approach should not only involve identifying symptoms but also discerning underlying pathological changes in the brain. The current diagnostic framework for AD encompasses a comprehensive approach that combines clinical assessments, neuropsychological tests, neuroimaging methods, and biomarker analysis (Martí-Juan et al., 2020; Khan and Zubair, 2022a,b). Although these methodologies have provided valuable insights, there are still persistent problems, such as the need for early detection and the development of more accurate and reliable diagnostic tools. Thus, it is within this context that the role of protein biomarkers becomes crucial. Protein biomarkers, including Amyloid-Beta (Aβ), tau, and ptau [AT(N)] have emerged as a potential indicator in identifying the complex molecular and cellular alterations linked with AD (Martí-Juan et al., 2020; Khan et al., 2023). These biomarkers aid in the early detection of AD pathology, facilitating diagnosis at the MCI stage when interventions may be most effective. The incorporation of these indicators into diagnostic frameworks is consistent with the continued research of new strategies to address the challenges encountered by AD. However, diagnosis at preclinical stages, characterized by the absence of clinical symptoms, is not recommended in clinical practice and is primarily reserved for research purposes.

Machine learning (ML), a subfield of artificial intelligence, has emerged as a transformative technology in the healthcare industry, with the potential to change how neurodegenerative diseases are diagnosed and managed (Alowais et al., 2023). ML algorithms, equipped with the capacity to analyze vast datasets, recognize patterns, and derive meaningful insights, offer a paradigm shift in identifying subtle changes in neurological parameters that precede overt symptoms (Bhatia et al., 2022; Javaid et al., 2022). By assimilating information from multiple sources, for instance, neuroimaging, clinical data, and genetic profiling, ML algorithms contribute to the development of predictive models that aid in early detection and personalized treatment strategies (Jiang et al., 2017; Ahmed et al., 2020; Hossain and Assiri, 2020; Khan and Zubair, 2022a,b; Arafah et al., 2023; Assiri and Hossain, 2023).

In the present study, the optimization of predictive models for the diagnosis of AD pathologies was carried out using a set of baseline features, and the model performance was improved by incorporating additional variables associated with patient drugs and protein biomarkers into the model. Early AD was diagnosed by considering two key criteria: firstly, whether a patient was taking specific medications, and secondly, the presence of a significant protein serving as a predictor of Aβ, tau, and ptau levels among participants. In particular, we examined the relationship between AD diagnosis and the use of various medications (calcium and vitamin D supplements, blood-thinning medications, cholesterol-lowering drugs, and cognitive drugs). We also assessed the significance of three cerebrospinal fluid (CSF) biomarkers, tau, ptau, and Aβ in the diagnosis of AD. The relative importance of these biomarkers in diagnosing AD is still a topic of discussion in the academic community (Brookmeyer et al., 2007; Gauthier et al., 2021).

The adoption of a hybrid-clinical model, incorporating the simultaneous operation of multiple ML models in parallel, emerges as a viable strategy for enhancing predictive accuracy. Given the heterogeneous nature of the dataset under consideration, employing multiple ML models in parallel allows for a comprehensive classification approach. Subsequent to the individual classification outputs generated by each classifier, a majority voting mechanism is employed to aggregate predictions. This collective decision-making process, leveraging the consensus among classifiers, serves to enhance overall predictive accuracy. Notably, the incorporation of parallelization principles within our model framework not only contributes to improved performance but also facilitates efficiency gains by optimizing computational resources.

The proposed model is used to simulate the MRI-based data, with five diagnostic groups of individuals (cognitively normal, significant subjective memory concern, early mildly cognitively impaired, late mildly cognitively impaired, and AD), with a further refinement which includes preclinical characteristics of the disorder. It is noteworthy that we aimed to construct a pipeline design employing ML that incorporates comprehensive methodologies to detect Alzheimer's over a wide- ranging input values and variables in the current study. The proposed design builds a meta-model based on four distinct sets of criteria, which include diagnosing from baseline features, baseline and medication features, baseline and protein features, and baseline, medication, and protein features. The meta-model incorporated a 4-step data preprocessing strategy, followed by feature wrapping using the step-forward technique. Furthermore, twelve efficient ML algorithms served as base classifiers. During the construction of the hybrid model, both stacking and voting techniques were employed. Preceding this, cross-validation with 5 and 10 folds was implemented alongside hyperparameter optimization. Subsequently, performance evaluation and comparison were conducted based on various metrics.

Thus, this research seeks to contribute to the broader effort of improving diagnostic approaches for Alzheimer's. This study aims to develop a robust multi-composite machine learning model that improves diagnostic accuracy by studying the intricate relationship between protein biomarkers, drugs, and AD. The model that we have developed offers a tool for healthcare practitioners to advance Alzheimer's diagnosis while also laying the groundwork for further investigation into AD and other neurodegenerative conditions.

2 Methods 2.1 Proposed design

In this section, we present the proposed meta-model, followed by an outline of the key steps involved in our method. The purpose of creating a pipeline environment is to streamline the entire process and ensure that the procedure is successful, i.e., to facilitate internal verification and produce outcomes that are reproducible externally. A schematic flow of our end-to-end approach is illustrated in Figures 1, 2. In the subsequent sections, a detailed description of the proposed approach and the various steps undertaken in this study are presented.

www.frontiersin.org

Figure 1. Hybrid-clinical model architecture.

www.frontiersin.org

Figure 2. Workflow design for data cleansing framework.

Every machine learning model possesses unique characteristics that enable it to achieve success with specific types of data. As this task involves various kinds and groups of data, we suggest utilizing multiple machine learning models in parallel to classify the data. Once we obtain the output of each classifier, which is a specific class, we aggregate the majority vote of the classifiers to improve overall accuracy. Our model employs the principle of parallelism to increase system accuracy and expedite processing time.

2.2 Study design, participants and dataset collection

Alzheimer's Disease Neuroimaging Initiative (ADNI) is a comprehensive repository that was established in 2004 and is headed by Michael W. Weiner as the principal investigator. This repository contains data on clinical, biochemical, genetic, and imaging biomarkers for detecting AD and MCI at an early stage, monitoring their progression, and tracking their development over time through a series of longitudinal, multicenter studies.

The baseline statistical model that we designed was based on the ADNIMERGE dataset from ADNI, which contained selected factors relevant to individuals' clinical, genetic, neuropsychological, and imaging results. The ADNIMERGE dataset consists of four distinct studies, namely ADNI-1, ADNI-2, ADNI-3, and ADNI-GO, which were collected at varying stages of the research project and represent different time periods. Each dataset includes new patients who were enrolled during the study period, as well as previous patients who were continuously monitored. The ADNIMERGE dataset includes 2,175 individuals, ranging in age from 54 to 92 years, and contains 14,036 input values for 113 features. These values were collected over a period of 8 years (2004–2021), with the initial measurement taken when the patient first arrived, followed by a 6-month follow-up visit every year for 8 years. We selected only patients who participated in the ADNI-1 phase of the initiative, which comprised 818 individuals and a total of 5,013 input values across 113 variables. These participants were characterized by demographic information, neuropsychological, genetic, MRI, Diffusion-tensor imaging (DTI), electroencephalography, and positron emission tomography (PET) biomarkers. This was done in order to maintain uniformity across studies and data handling, as well as to ensure that we could successfully select only a single observation for each subject.

This study included and integrated three varied datasets i.e., baseline data, drug data and protein biomarker data. Separate data files containing drug and protein biomarker data were stored on the ADNI repository. They were examined to extract the pertinent ones and were added to the baseline data.

2.2.1 Baseline dataset

The baseline dataset comprised 818 participants from the ADNI-1 dataset. given our objective of using predictors that have been consistently evaluated and effectively integrated into clinical settings, while also being non-intrusive to patients, we have chosen to exclusively utilize variables from the ADNI-1 baseline dataset. these variables pertain to diagnostic subtypes, demographic details, and scores from clinical and neuropsychological tests. only diagnoses that were confirmed from screening up until the baseline visit were considered, whereas any patient data with more than 20% missing information were eliminated.

Based on the diagnosis at the follow-up visit, the patients who were diagnosed at baseline were classified into 5 distinct categories: Cognitively Normal, Early Mild Cognitive Impairment, Late Mild Cognitive Impairment, Significant Memory Complaints, and Alzheimer's Disease, abbreviated as CN, EMCI, LMCI, SMC, and AD. The diagnostic classes EMCI and LMCI were unified into a single diagnostic class MCI. An additional advantage of doing so was that it made it easier to extend our model in the future for analyses that might contain additional ADNI data. Because SMC patients met the criteria for being cognitively normal, the CN and SMC classes were combined into a single category. The diagnosis features, which served as a response variable, were thus divided into 3 categories: AD, MCI, and CN. Among the 818 patients involved in the study, 193 individuals were identified with AD, 396 with MCI, and 229 with CN. The demographic information of the study participants, categorized by their baseline diagnosis, is presented in Table 1. Age, years of education, gender, marital status, ADAS11, ADAS13, ADASQ4, CDRSB, DIGITSCOR, FAQ, LDETOTAL, MMSE, mPACCdigit, mPACCtrailsB, TRABSCOR, RAVLT-I, RAVLT-L, RAVLT-F, and RAVLT-PF, are the final variables considered in the present study. The statistics and descriptions of these variables are provided in Table 2.

www.frontiersin.org

Table 1. Subject demographics.

www.frontiersin.org

Table 2. Variable description and related statistics.

2.2.2 Drug dataset

We created five new variables from components inside the drug dataset that fall into one of our five analytic categories: blood thinners, calcium doses, cholesterol-lowering medicines, cognitive drugs, and vitamin d medicines. Table 3 contains a comprehensive list of individual drug and supplement names. it is critical to note that the use of any of these drugs did not preclude a patient from participating in the ADNI cohort (see footnote).

www.frontiersin.org

Table 3. Drug list.

2.2.3 Protein biomarker dataset

Using the ADNI dataset, we generated three new variables from components within the CSF biomarker data to determine the amounts of Aβ, tau, and ptau.

Therefore, altogether three unique datasets (baseline data, drug data, and protein data) were gathered, extracted and handled, and then were passed to the next step i.e., data cleansing framework for data pre-processing (Figure 2).

2.3 Data cleansing framework

A flowchart depicting the process of complete data cleansing is portrayed in Figure 2. Initially, the controllers were configured with specific data values, including C = 1 for baseline data, C = 2 for baseline and drug data, C = 3 for baseline and protein data, and C = 4 for baseline, drug, and protein data, respectively. Table 4 shows a complete description of the procedure for implementing the four sets of controllers. After the execution of the data cleansing process, 24 individual clean files were generated.

www.frontiersin.org

Table 4. Execution steps for data cleansing framework.

3-step data pre-processing strategy:

The dataset acquired was processed with a three-step ITR approach i.e., Imputation (I), Transformation (T), and Reduction (R).

2.3.1 Imputation

In this study, the dataset encountered issues of noise, incompleteness, and inconsistency, which typically hinder the mining process. In general, inaccurate or dirty data often pose challenges for mining techniques, which can obstruct the extraction of valuable insights. The proportion of missing values for the extracted variables (at baseline) is depicted in Table 2. To overcome the challenge of missing data, there exists various techniques. Specifically in this study, three different approaches were employed to address this issue. The simplest method involved removing all instances of missing data, which was implemented as the first strategy, referred to as “without imputation.”

There are various methods for handling missing data, including weighting, case-based, and imputation-based techniques (Tartaglia et al., 2011). The latter technique was used in the present study. Imputation involves predicting missing data values and then filling them with suitable approximations, such as the mean, median or mode. Subsequently, standard complete-data techniques are then applied to the filled-in data to decrease the biases due to missing values and improve the efficiency of the model. The term “mean imputation” refers to replacing missing values with suitable approximations, such as the mean, and subsequently applying standard complete-data procedures to the filled-in data (Khan and Zubair, 2019). This was the second approach employed in the study. The “model imputation” method, on the other hand, involves replacing missing values with appropriate approximations, such as a linear regression model, and then using a standard complete-data process to the filled-in data (Khan and Zubair, 2019). This was the third approach used in the study.

2.3.2 Transformation

In this study, data transformation involved two main techniques: normalization and smoothing. These techniques aim to improve the quality and interpretability of the data (Pires et al., 2020; Maharana et al., 2022). In this study, normalization was applied to the ADNI dataset to standardize the scale of numerical values, which varied in range for different variables. The values were adjusted and transformed in a way that they fall within a specified range, often between 0 and 1 (Pires et al., 2020; Maharana et al., 2022). This adjustment ensured that each variable had equal importance in the analysis and prevented any one variable from dominating due to its scale. Smoothing was then performed to remove any noise or irregularities in the ADNI data, making it easier to identify underlying patterns. This was particularly useful for our research, as the data contained random variations and anomalies that could have masked meaningful trends and patterns.

2.3.3 Reduction

Data reduction methodologies play a significant role in the analysis of reduced datasets while maintaining the integrity of the original data (Khan and Zubair, 2022a). This approach is often used to enhance efficiency, streamline analysis, and effectively manage large datasets (Maharana et al., 2022). There are several techniques for implementing data reduction i.e., dimension reduction, sampling, aggregation, and binning. Each method is applied based on the type of dataset and variables present. In this study, we employed a dimension-reduction strategy. This method facilitated in identifying and eliminating variables and dimensions that were insignificant, poorly correlated, or redundant.

2.4 Load clean data

The subsequent step involved loading the clean data (as depicted in Figure 1). Initially, the clean data files generated when the controller was set to 1 were loaded and the entire pipeline was executed. Similarly, this process was repeated for the remaining three controllers, 2, 3, and 4, resulting in the creation of twelve different ML meta-models. Later, comparisons were performed to determine which model performed optimally across a range of applied methods.

2.5 Machine learning modeling

Machine learning techniques were utilized to develop a classifier capable of identifying potential instances of AD and MCI. A hybrid-clinical classification model was constructed, incorporating variables selected during the feature selection process. This model development process was repeated for each of the 4 controllers separately. The clean data was passed to the base ML classifiers. The model was trained on the training set. The feature selection process was conducted to determine the optimal set of features. The performance was assessed, and it was made to run iteratively until the best feature set was identified. Subsequently, we applied 5-fold repeated stratified cross-validation and hyperparameter optimization techniques to obtain an optimized set of algorithms. The optimized classifier, trained on the complete training set, was then applied to the independent test set following 10 iterations of 5-fold repeated stratified cross-validation on the training set.

The entire modeling process is explained in the following sections and can be seen in Figures 1, 3, 4 pictorially.

www.frontiersin.org

Figure 3. Five-step process: step forward feature selection.

www.frontiersin.org

Figure 4. Ten-fold repeated stratified cross-validation.

2.5.1 Base ML models

We used a varied set of ML algorithms and techniques in our operations. A good implementation of the algorithms in question was known while selecting these tools for this study. In this study, twelve efficient ML models were used, namely, Multinomial Logistic Regression, K-Nearest Neighbors, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Decision Tree, Random Forest, AdaBoost, Principal Component Analysis with Logistic Regression, Support Vector Machine - Radial Basis Function, Perceptron, MultiLayer Perceptron, and Elastic Nets. Earlier, fifteen supervised learning classifiers were executed to study the impact of the built model. However, we selected twelve classifiers out of these fifteen classifiers. Gaussian Process, LinearSVC and Stochastic Gradient Descent are the three ML algorithms that could not fit on the ADNI dataset efficiently and hence resulted in reduced performance. These 12 models were chosen for their ability to produce high performance during the model development step. These models acted as the base ML models for the creation of a hybrid meta-model.

The decision to use twelve ML classifiers out of fifteen can be attributed to the thorough consideration of algorithm performance and effectiveness in this study. The selection of these twelve classifiers was based on the following factors:

• Diverse Set of Algorithms: The initial set of fifteen classifiers likely encompassed a diverse range of ML algorithms, allowing for comprehensive coverage of different approaches. By including a variety of models, the study aimed to explore the strengths and weaknesses of various algorithms, ensuring a more robust understanding of the data and problem domain.

• Performance Evaluation: Initially, we explored fifteen supervised learning classifiers. The decision to narrow down to twelve models suggests that an extensive evaluation was conducted, and the performance of each algorithm was thoroughly assessed. Only the top-performing models were retained for further analysis.

• Elimination of Underperforming Models: The decision to exclude certain models was based on their underperformance and lack of contribution to achieving the study's objectives, emphasizing the importance of selecting the most effective algorithms.

The description of each employed classifier is presented below:

A. Multinomial Logistic Regression (MLR): In binary logistic regression, we estimate the probability of a single class, while in MLR, we extend this concept to estimate probabilities for multiple classes. This approach is particularly effective for dependent variables with three/more unordered categories. MLR applies a logistic function to each category to calculate the probability of belonging to that specific group, with k categories are represented by k-1 logistic functions (Khan et al., 2024). To interpret the results in terms of relative likelihoods, the probabilities are then normalized to ensure that they sum to 1 across all categories (Aguilera et al., 2006; Yang, 2019). MLR calculates a set of coefficients for each category, representing the correlation between predictor variables and log-odds of belonging to that group (Yang, 2019; Reddy et al., 2024). Each category has its intercept, which represents the log-odds of all predictor variables being zero. The MLR is usually trained using maximum likelihood estimation, where the model parameters are calculated to maximize the likelihood of observing the specific set of outcomes (Hedeker, 2003).

B. K-Nearest Neighbors (KNN): The KNN algorithm is a supervised ML technique that is primarily used for classification. The main concept of KNN is to generate predictions by examining the majority class amongst the “K” nearest data points. In the context of classification, when presented with a new data point, KNN examines the “K” data points in the training set that are in closest proximity to it (Li, 2024). The class with the highest prevalence among these nearby data points is then allocated to the new data point. Typically, Euclidean distance (a distance metric) is employed by KNN for measuring the similarity or proximity of data points within a multi-dimensional feature space (Zhang et al., 2022; Li, 2024). The parameter “K” indicates the total number of neighbors to take into account. The choice of an appropriate “K” is essential as it can significantly impact the quality of predictions. A small value of K can result in noisy predictions, whereas a larger value of K can help to identify underlying patterns more accurately (Li, 2024). The KNN classifier identifies a set of neighbors and then counts the number of instances of each class amongst these neighbors. The class with the most occurrences is then assigned to the new data point. This is an instance-based learning algorithm that memorizes the training instances rather than explicitly learning a model. KNN is a non-parametric method which means it makes no assumptions regarding the underlying distribution of the data (Wang et al., 2022).

C. Linear Discriminant Analysis (LDA): It is a supervised algorithm designed to discover the optimal linear combinations of features that efficiently separate different classes within a dataset (Tharwat et al., 2017). By maximizing the separation between these classes, LDA effectively results in the formation of different groups of data points (Seng and Ang, 2017). The algorithm intrinsically performs dimensionality reduction, which is an inherent feature of the algorithm. This involves projecting the data onto a lower-dimensional space while preserving the most informative features for classification (Seng and Ang, 2017; Graf et al., 2024). LDA makes the assumption that the covariance matrix is consistent across all classes and that the features within each class follow a normal distribution (Seng and Ang, 2017; Tharwat et al., 2017). To effectively distinguish the distribution of data points within and between classes, LDA computes mean vectors and scatter matrices. Mean vectors represent the centroids of data points in each class, whereas scatter matrices capture the spread or variability within each class. LDA then performs eigenvalue decomposition on the generalized eigenvalue problem that includes within-class and between-class scatter matrices (Tharwat et al., 2017). The directions of maximum discrimination are revealed by the eigenvectors that correspond to the largest eigenvalues. The data points are then projected onto these discriminant directions, resulting in a new space where the classes are clearly separated (Tharwat et al., 2017).

D. Quadratic Discriminant Analysis (QDA): It is a supervised classification algorithm that aims to determine the optimal boundaries for separating different classes in the feature space (Jiang et al., 2018). In contrast to LDA, QDA provides greater flexibility in capturing variability within each class by allowing distinct covariance matrices for each class (Witten et al., 2005; Tharwat et al., 2017). QDA, analogous to LDA, aims to maximize the degree of segregation among distinct classes. This is achieved through the identification of quadratic decision boundaries, which adequately represent the complex relationships among the features (Witten et al., 2005). Because LDA presumes that all classes utilize the same covariance matrix, QDA allows each class to possess a unique covariance matrix. Thus, QDA is more adaptable when handling classes that may exhibit diverse variability patterns (Siqueira et al., 2017). Similar to LDA, QDA calculates the mean vector for each class that serves as the centroid of the data points in that class. In addition, QDA determines the scatter matrices for each class, which reflect the dispersion or variability present within each class (Witten et al., 2005). Following that, it sets up quadratic decision boundaries that effectively separate classes using data from the class means and scatter matrices. The function of the decision boundaries is to classify the newly acquired data points. Compared to LDA, QDA is better at finding non-linear correlations and complex patterns within each class because it uses different covariance matrices (Siqueira et al., 2017).

E. Decision Tree (DT): A Decision Tree classifier is a tree-like model that follows a hierarchical structure, where input data points are classified based on a series of decisions made at each node (Khan and Zubair, 2022a,b). Leaf nodes represent the predicted class, whereas internal nodes reflect decisions based on particular features. The tree structure is composed of nodes that learn the most important features for classification. In DT, the features that provide the best separation of classes are positioned closer to the root of the tree (Blockeel et al., 2023; Costa and Pedreira, 2023). The root node is the top node in the hierarchy. It represents the complete dataset and is divided according to the feature that best separates classes. The process of selecting features and splitting the dataset recursively continues until a stopping criterion is met (Blockeel et al., 2023; Costa and Pedreira, 2023). When a new data point traverses the tree, it follows the decision path based on the conditions of the features until it gets to a leaf node. The predicted class for the input data point is then determined by the class associated with that leaf node.

F. Random Forest (RF): Random Forest is an ensemble learning ML model that aggregates predictions from multiple ML models to construct a more robust and precise model (Belle and Papantonis, 2021). It employs decision trees as its base model and uses a bagging technique, which trains several trees on various subsets of the training data (Campagner et al., 2023). For classification, the final prediction of the RF algorithm is determined through a voting mechanism. Each tree votes for a class and the class with the most votes is assigned to the input data point. RF produces multiple subsets of training data using random sampling with replacement (bootstrap sampling). Each subset trains a distinct DT. For each DT, a feature subset is randomly selected at each split. This approach guarantees that the generated trees are diverse, thereby minimizing the likelihood of overfitting to specific features (Wang et al., 2021; Campagner et al., 2023). Each DT is trained independently, allowing the ensemble to capture various aspects of the data and lessen overfitting risk (Campagner et al., 2023). The final prediction is determined through a voting method in which the individual tree predictions are aggregated. The class with the maximum votes is the final predicted class.

G. AdaBoost (AB): AdaBoost, an acronym for Adaptive Boosting, is an ensemble learning algorithm in which a robust classifier is constructed by combining several weak learners. The primary aim of the AB algorithm is to train weak classifiers iteratively on different subsets of the data and allocate high weights to instances that have been incorrectly classified during each iteration (Ying et al., 2013). The final model integrates the predictions of all weak learners with varying weights, giving preference to those that perform adequately on training data. AB classifier initializes each data point in the training set with equal weights (Ding et al., 2022). In subsequent iterations, the weights are modified to focus on instances that are difficult to correctly classify. Weak learners are trained sequentially; and at each iteration, a new weak classifier is fitted to the data (Ying et al., 2013; Ding et al., 2022). Instances that are misclassified are assigned greater weights, resulting in the final model being a weighted sum of all weak classifiers. The weights are calculated based on each classifier's accuracy on the training data, with models that demonstrate higher performance contributing proportionally more to the final prediction (Haixiang et al., 2016). The final model is a combination of all weak classifiers, with the weights chosen based on their accuracy on the training data.

H. Principal Component Analysis with Logistic Regression (PC-LR): PC-LR is a method that combines the Principal Component Analysis (PCA) method (meant for dimensionality reduction) with the logistic regression (LR) algorithm (meant for classification tasks). PCA creates a new set of uncorrelated features, known as principal components, which capture the maximum variance in the data, transforming the original features (Khan and Zubair, 2020). By employing this method, the dimension of the feature space is decreased. Logistic Regression is a widely-used classifier that effectively handles linearly separable data and calculates the likelihood of an instance belonging to a specific class (Yang, 2019). By combining the dimensionality reduction capabilities of PCA with the classification power of Logistic Regression, the PC-LR method aims to retain the most informative components while reducing overall dimensionality (Aguilera et al., 2006; Yang, 2019). The input data is first transformed into a lower-dimensional space using PCA, and then logistic regression is applied to make predictions based on the reduced feature set. The LR model is trained to determine the probability that a given instance belongs to a particular class. When generating predictions for new data, the class with the highest probability is assigned as the final prediction.

I. Support Vector Machine - Radial Basis Function (SVM-RBF): SVM is a supervised learning technique that seeks to identify a hyperplane in N-dimensional space (N being the number of features) that best distinguishes between data points from various classes (Siddiqui et al., 2023). The objective is to optimize the margin, defined as the distance between the hyperplane and the closest data points from each class. RBF is a widely utilized kernel function in SVM. Transforming the input space to a higher-dimensional space enables the RBF kernel to effectively capture non-linear correlations in the data (Ding et al., 2021). The transformed space enables SVM to identify a non-linear decision boundary within the original feature space. SVM-RBF searches for the hyperplane in the transformed space that effectively segregates the data points into their respective classes (Ding et al., 2021). The hyperplane is used to maximize the margin, thereby establishing a robust decision boundary. SVM allows for the existence of some misclassified data points to handle cases where a perfect separation is not possible (Siddiqui et al., 2023). The key data points that influence the decision boundary are called support vectors. The RBF kernel has a parameter called the gamma (γ), which determines the shape of the decision boundary (Valero-Carreras et al., 2021). Tuning the gamma parameter is crucial, as a small gamma may lead to underfitting, while a large gamma may lead to overfitting (Sacchet et al., 2015). After establishing the decision boundary, SVM-RBF is capable of classifying newly acquired data points by determining which side of the hyperplane they lie on.

J. Perceptron (PC): The Perceptron classifier, originally developed for binary classification, can be tweaked to perform multi-class classification using a strategy called the One-vs-All and One-vs-Rest (Raju et al., 2021). The OvA strategy involves training multiple Perceptrons, each dedicated to distinguishing one specific class from all the others (Kleyko et al., 2023). For K classes, K Perceptrons are trained, where each Perceptron focusses in recognizing one class and considers the instances of that class as the positive class and all other instances as the negative class. For a multi-class problem with K classes, K Perceptrons are trained. Each Perceptron is allocated to one class, and it aims to correctly classify instances belonging to that class against instances from all other classes (Chaudhuri and Bhattacharya, 2000; Raju et al., 2021). Throughout the training phase, the weights of each Perceptron are tweaked based on the instances belonging to its allocated class. The aim is to locate weights that reduce the classification error for that specific class. During the prediction phase, each Perceptron independently predicts for a given input instance. The class associated with the Perceptron that outputs the highest confidence (largest net input) is then assigned as the predicted class for that instance (Chaudhuri and Bhattacharya, 2000; Kleyko et al., 2023). In essence, the decision for multi-class classification is made by employing a one-vs-rest strategy, where each Perceptron is treated as a binary classifier for its assigned class vs. all other classes.

K. MultiLayer Perceptron (ML-PC): This classifier is a type of artificial neural network that is devised to handle challenging ML tasks, such as multi-class classification (Chaudhuri and Bhattacharya, 2000). An ML-PC is comprised of multiple interconnected layers, such as an input layer, one/more hidden layers, and an output layer (Liu et al., 2023). Data flows through the network in a feedforward manner, with each node in a layer processing information from the previous layer and passing it to the next layer. Each node's weighted sum of inputs is subjected to activation functions such as hyperbolic tangent (tanh), sigmoid, and rectified linear unit (ReLU). These functions add non-linearity, allowing the network to understand intricate connections. The input layer represents the features of the input data, with each node corresponding to a feature and the values being the feature values (Chaudhuri and Bhattacharya, 2000). Hidden layers process information from the input layer, capturing complex, non-linear patterns in the data. The output layer generates the final predictions for each class in a multi-class setting, with each node's output representing the model's confidence in predicting that class. Backpropagation, a supervised learning technique, operates to reduce errors by iteratively modifying weights. The difference between the predicted and actual outputs is determined by the loss function. In order to minimize this difference, optimization algorithms like gradient descent are implemented, which modify the weights (Chaudhuri and Bhattacharya, 2000; Liu et al., 2023). The learning rate determines the size of each weight update.

L. Elastic Nets (EN): Elastic Nets, a regularization technique, were originally designed for linear models. In multi-class classification setups, the linear model is extended to handle multiple classes (Mol et al., 2009). Elastic Networks employ a combination of L1 (Lasso) and L2 (Ridge) methods of regularization. L1 regularization encourages sparsity in the model, promoting feature selection, while L2 regularization prevents large coefficients (Zhan et al., 2023). For multi-class classification, Elastic Nets can be applied to extend linear models to predict probabilities for multiple classes. Elastic Nets can be used in conjunction with strategies like One-vs-All (OvA) and One-vs-One (OvO) to handle multi-class challenges (Chen et al., 2018). OvA trains a distinct model for each class against the rest, whereas OvO trains models for each pair of classes. The hyperparameters alpha and l1_ratio, which control the balance between L1 and L2 regularization, must be carefully chosen in Elastic Nets. In general, the output layer utilizes a softmax activation function to transform raw model outputs into class probabilities, with the certainty that the sum of the predicted probabilities equals 1. During training, Elastic Nets optimize the model's weights using algorithms like gradient descent, to reduce the disparity between predicted and actual class probabilities (Aqeel et al., 2023). The cross-entropy loss function is frequently employed in multi-class classification tasks.

2.5.2 Feature selection

Feature selection is a significant step while preparing data for ML modeling. A subset of the most pertinent features is chosen from the original set. The primary aim is to enhance model performance, simplify the model, and mitigate the risk of overfitting (Pudjihartono et al., 2022). A model with fewer features is often more interpretable, making it easier to understand the relationships between variables (Barnes et al., 2023).

There are 3 types of feature selection strategies i.e., filter, wrapper, and embedded. In this study, we employed the wrapper method (Dokeroglu et al., 2022; Pudjihartono et al., 2022; Kanyongo and Ezugwu, 2023). Wrapper methods involve evaluating the performance of a ML model based on different subsets of features. Unlike filter methods that assess feature relevance independently of the model, wrapper methods use the actual performance of the model as a criterion for selecting features. This involves training the model multiple times, which can be computationally expensive but may yield more accurate results (Kanyongo and Ezugwu, 2023). Furthermore, there are 3 types of wrapper methods viz. forward selection, recursive feature elimination, and backward elimination. In this study, we employed the step forward feature selection method. It is a specific wrapper method that builds the feature set incrementally. It starts with an empty set of features and adds them one at a time, based on how they affect the model's performance. The process involves five steps, illustrated in Figure 3.

2.5.3 Cross-validation and hyperparameter optimization

The objective of this study was to construct such a model that can exhibit optimal generalized performance rather than only for the cases used during training. Consequently, cross-validation gives an estimate of the overall performance for each hyperparameter configuration (Khan and Zubair, 2022a). To achieve this, the train data was divided into 5 folds. Instances from each fold were held-out from the training process, while the remaining cases were trained iteratively. Subsequently, the algorithm was then applied to the held-out samples following their training.

In this study, ten iterations of a 5-fold repeated stratified CV training and testing approach were implemented to maintain the distribution of classes across each fold (Figure 4). The Scikit-Learn library in Python, for instance, provides a RepeatedStratifiedKFold class that was employed for implementing this type of cross-validation. This was employed for evaluating and fine-tuning the models since we were dealing with scenarios where data variability and class imbalance challenges needed to be carefully managed.

An imbalanced classification problem can pose a significant challenge when building a ML model, especially when the data distribution is skewed toward the target variable (Kanyongo and Ezugwu, 2023). In such cases, if not addressed properly, the model may perform poorly, resulting in low accuracy. In the present study, we aimed to resolve the imbalanced classification problem by employing the Synthetic Minority Oversampling Technique (SMOTE) method. SMOTE is a data augmentation technique, designed to handle minority classes (Kohavi, 1995; Chawla et al., 2022). As previously stated, there were 3 classes of the target variable, including cognitively normal, MCI, and AD subjects. But we discovered a significant disparity in the MCI and AD classes. As a result, the built ML model resulted in poor performance and low accuracy. To address this issue, we utilized SMOTE analysis to oversample the minority class, which balanced the class distribution without adding any further information to the ML model.

The hyperparameters were fine-tuned before assessing each classifier. Hyperparameters are essential for structuring ML models and are not learned from the data during training. Hyperparameter optimization is the process of systematically searching for the best combination of hyperparameter values to achieve optimal performance from the model (Yang and Shami, 2020; Khan and Zubair, 2022a). It is crucial to optimize these hyperparameters to obtain the best possible results when applied to unseen instances. ML algorithms often have one or more hyperparameters that can be adjusted during the training process. By varying these hyperparameters, the algorithm's prediction performance can be varied. In this study, each model was trained using specific hyperparameter configurations to optimize the hyperparameters for each employed ML algorithm.

2.6 Analyze results

The purpose of this step was to assess the performance of the base models and determine whether they exhibited improved performance. If not, the model was subjected to additional training for an extended period of time, using a varied set of optimization hyperparameters.

2.7 Evaluate test performance

This stage determined the performance of test splits on the test data. If the performance was found to be poor, the test distribution was reevaluated, and any discrepancies were rectified by creating equitable splits. If the performance proved to be acceptable (with a high level of accuracy), the subsequent step was carried out and a meta-model was constructed.

2.8 Build meta-model

To construct a meta-model, we employed an ensemble learning approach that involved selecting different outputs generated by 12 machine learning classifiers after modeling and optimization. These classifiers were made to run in parallel and subjected to a voting and stacking process. Majority voting and majority stacking are ensemble learning approaches that integrate the predictions of numerous independent models to improve overall predictive performance (Raza, 2019; Dolo and Mnkandla, 2023). Both methods involve aggregating the decisions of multiple models, but they differ in their approaches. Majority voting encompasses multiple models making independent predictions on a given input, with the final prediction being determined by the majority vote or consensus of these individual predictions (Zhao et al., 2023). On the other hand, majority stacking is a more sophisticated method that trains a meta-model, often referred to as a stacker or meta-learner, to combine the predictions of multiple base models (Aboneh et al., 2022; Dolo and Mnkandla, 2023). The process of averaging predictions involves selecting the class with the most votes (the statistical mode) or the class with the highest summed probability. Stacking extends this method by enabling any machine learning model to learn how to integrate predictions from contributing members optimally.

After applying majority voting and majority stacking, we evaluated their per

留言 (0)

沒有登入
gif