A Dynamic Adaptive Ensemble Learning Framework for Noninvasive Mild Cognitive Impairment Detection: Development and Validation Study

IntroductionBackground

Neurodegenerative conditions such as Alzheimer disease (AD) and related dementias precipitate accelerated cognitive deterioration, markedly impacting patients’ daily lives and social engagement []. Current estimates suggest that approximately 50 million individuals worldwide suffer from dementia, with this number expected to soar to 152 million by 2050 []. Generally, patients diagnosed with mild cognitive impairment (MCI) are at a much higher risk of developing dementia []. MCI serves as an intermediate stage between normal cognitive aging and the severe pathological decline of dementia, influencing individuals’ cognitive functions, social abilities, and mental health, and may lead to emotional disorders that disrupt daily life []. Epidemiological data reveal that the incidence of MCI is 6.7% among those aged 60-64 years, 8.4% for 65- to 69-year-olds, 10.1% for 70- to 74-year-olds, 14.8% for 75- to 79-year-olds, and 25.2% for 80- to 84-year-olds []. The annual transition rate from MCI to dementia or AD is about 10%-15% [], significantly higher than the 1%-2% annual incidence of dementia in the general population. Despite various potential treatments for AD, including enzymes that inhibit the production of amyloid-β and antibodies that clear amyloid-β from the brain [], no current medications can fully cure dementia or significantly alter its clinical course. Moreover, studies indicate that early intervention is effective, necessitating precise and sensitive diagnostic measures for MCI []. Thus, early identification of MCI is crucial as it enables timely interventions to slow cognitive decline and alleviate the burden of dementia [].

Wearable devices provide a near-continuous, passive data collection method, offering a convenient and minimally invasive approach for the ongoing monitoring and tracking of cognitive decline in patients with MCI. Existing studies have demonstrated that various physiological indicators, such as heart rate variability [], electrodermal activity [], gait variability [], skin temperature [], respiratory rate [], electroencephalography [], eye movement [], and electromyography [], can be effectively used to assess cognitive function changes, providing an objective basis for the auxiliary diagnosis of early cognitive impairment. However, despite the availability of diverse physiological data from patients with MCI, challenges remain in the effective utilization of these data due to the complexity of high-dimensional information (eg, feature redundancy, strong interfeature correlations, and noise interference) and the technical difficulties in multimodal data integration (eg, insufficient feature extraction and dimensionality reduction methods, challenges in aligning heterogeneous modalities, and limitations in handling noise and missing data).

In recent years, machine learning techniques have been increasingly applied to analyzing and processing complex, high-dimensional physiological data to facilitate the early detection of cognitive disorders, including MCI. Traditional algorithms such as naive Bayes [], k-nearest neighbors (KNN) [], support vector machines (SVM) [], and logistic regression (LR) [] have demonstrated a certain degree of effectiveness in identifying high-risk MCI populations. However, due to the limitations of single algorithms in modeling high-dimensional and multimodal data, such as insufficient representational capacity and unstable generalization performance, researchers have gradually shifted toward exploring ensemble methods, including bagging [], boosting [], and stacking []. These ensemble learning techniques integrate predictions from multiple models, effectively mitigating the limitations of single models and significantly improving overall predictive accuracy and robustness.

In addition, various swarm intelligence algorithms have been introduced for critical tasks such as feature selection and hyperparameter optimization to enhance the performance of machine learning models in high-dimensional data analysis. Swarm intelligence algorithms, including Harmony Search (HS) [], Particle Swarm Optimization [], and Genetic Algorithms [], simulate cooperative behaviors observed in nature and have demonstrated outstanding potential in solving global optimization problems. The HS algorithm has gained attention as a metaheuristic optimization technique due to its simplicity, ease of implementation, and low-parameter adjustment requirements. Inspired by musical harmony improvisation, the HS algorithm iteratively adjusts the pitch of each instrument (analogous to high-dimensional features) to find the optimal feature combination. This approach offers an effective solution for feature selection involving physiological data and cognitive parameters of MCI patients, showing promising prospects in improving model prediction accuracy and computational efficiency.

Objective

We propose a Dynamic Adaptive Ensemble Learning Framework based on an Improved Harmony Search (DAELF-HSI), designed to enhance MCI detection by addressing issues of feature redundancy and the inefficiency of multimodal data fusion. In contrast to previous studies, this research integrates multimodal physiological data collected through wearable wristbands (eg, heart rate variability and electrodermal activity) with cognitive assessment metrics recorded on tablet devices (eg, reaction time and test scores), aiming to exploit the potential value of multisource data comprehensively, thereby improving the accuracy and clinical utility of MCI detection. We hypothesize that the DAELF-HSI framework will not only effectively distinguish between patients with MCI and healthy individuals but also uncover critical discriminative information pertinent to MCI.

MethodsEthical Considerations

The research was reviewed and approved by the Biomedical Ethics Review Committee of Taiyuan University of Technology (20240124). All methods were performed following relevant guidelines and regulations. Written informed consent was obtained from eligible participants under the principles of the Declaration of Helsinki. All participants signed an informed consent form. We provided US $10 to eligible older adults as compensation for participation.

Overview of the Proposed Detection Framework

illustrates a Dynamic Adaptive Ensemble Learning Framework for MCI detection, integrating multimodal data that integrates individual physiological signals with cognitive tasks derived from serious games. The framework begins with data collection, followed by time series segmentation, alignment, and preprocessing. It then progresses to feature extraction and selection, culminating in constructing a classification model. Notably, the modules within the framework are interconnected and sequentially executed, forming a cohesive unit. The following sections will detail the stages, demonstrating the adaptability and effectiveness of the proposed MCI detection framework.

‎

Figure 1. A dynamic adaptive ensemble learning framework for mild cognitive impairment detection. CSI: cardiac sympathetic index; CVI: cardiovascular index; EDA: electrodermal activity, HF: high frequency; HSI: harmony search improved; IBI: interbeat interval; LF: low frequency; VLF: very low frequency. Experimental Participants and Procedures

The dataset used for machine learning modeling involves 843 participants aged 60 and older recruited from partner hospitals. The participants were randomly divided into a development dataset (674/843), and an independent testing dataset (169/843) in a ratio of 4:1. In addition, 226 older adults were recruited from 3 external centers to constitute an external testing dataset. Participants were identified using a purposive sampling method [], with the process being meticulously overseen by experienced neurologists. The inclusion criteria for participants were (1) age ≥60 years; (2) normal hearing and vision, or corrected to normal; (3) completion of the Mini-Mental State Examination (MMSE) test; (4) completion of the Montreal Cognitive Assessment (MoCA) test; (5) capability to engage in moderate activity without physical disabilities; (6) absence of severe depressive symptoms or other neurological disorders such as stroke or Parkinson disease; (7) ability to effectively use smart devices such as smartphones and tablets; and (8) informed consent signed by the participants or their guardians.

Neurologists contacted potential participants during their clinic visits, explaining the study’s purpose, related procedures, and the possible impact of the research findings. Once potential participants expressed interest, neurologists conducted comprehensive medical evaluations, including detailed medical history collection, physical examinations, brain imaging (magnetic resonance imaging or computed tomography scans), and cognitive function assessments (using the MMSE and MoCA scales). The MCI group comprised 514 (48.1%) participants who scored below 26 on the MoCA, while the healthy control (HC) group included 555 (51.9%) healthy individuals without symptoms of cognitive decline. Brain imaging scans revealed no structural abnormalities causing cognitive impairment. Furthermore, all patients with MCI met the criteria proposed by the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association []. We also administered the habitual hand questionnaire [], which consisted of 13 items, to all participants. The MCI and HC groups were matched for age, gender, hand preference, education, average sleep duration (in general), and years of smart device use. All participants completed cognitive tasks with the assistance of researchers, and there were no dropouts during the testing process. summarizes the clinical and demographic characteristics of the 1069 participants enrolled across the development, internal, and external datasets.

Table 1. Characteristics of the 1069 participants enrolled across the development, internal, and external datasets.
Training (n=674)Testing (n=169)External (n=226)P valuec
MCIa (n=328)HCb (n=346)MCI (n=78)HC (n=91)MCI (n=108)HC (n=118)
Age (years), mean (SD)70.39 (6.330)69.64 (5.919)69.85 (6.004)69.46 (5.763)74.29 (8.477)73.9
(8.613).11d; .67e; .73fGender, n (%).22d; .81e; .64f
Women181 (55.2)207 (59.8)44 (56.4)53 (58.2)61 (56.5)63 (53.4)—
Men147 (44.8)139 (40.2)34 (43.6)38 (41.8)47 (43.5)55 (46.6)—Hand preference, n (%).60d; .80e; .88f
Left41 (12.5)48 (13.9)6 (7.7)8 (7.5)13 (12.0)15 (12.7)—
Right287 (87.5)298 (86.1)72 (92.3)83 (91.2)95 (88.0)103 (87.3)—Education years, mean (SD)5.81 (3.781)6.32 (3.744)6.22 (4.012)6.48 (3.903)5.34 (3.421)5.22 (3.457).08d; .66e;
.79fHours of sleep, mean (SD)7.53 (1.165)7.40 (1.115)7.51 (1.066)7.82 (1.060)7.13 (1.421)7.01 (1.362).14d; .06e; .51fSmart device use (years), mean (SD)4.97 (2.060)5.07 (1.918)4.32 (2.665)4.79 (2.563)4.71 (2.336)4.90 (2.326).53d; .24e; .55f

aMCI: mild cognitive impairment.

bHC: health control.

cP value: 2-tailed t tests (for continuous variables) and chi-square tests (for categorical variables).

dP value: statistical comparisons were performed between the MCI and HC groups within the training dataset.

eP value: statistical comparisons were performed between the MCI and HC groups within the testing dataset.

fP value: statistical comparisons were performed between the MCI and HC groups within the external dataset.

In the experimental setting, a well-trained experimenter instructed participants to sit on a comfortable chair and wear the Empatica 4 on their nondominant wrist. The Empatica 4 is a watch-like multisensor device that measures physiological data such as electrodermal activity (EDA), photoplethysmography, skin temperature, and accelerometer readings. It is compact, lightweight, and comfortable to wear, making it suitable for unobtrusive continuous monitoring during cognitive screening of older adults. The participants performed cognitive tasks on a 2019 iPad using the Brain Nursing mobile app developed by our team () [], completing drawing-related tasks with an Apple Pencil. The system includes 11 single tasks and 3 dual tasks, each taking only 1-3 minutes, designed to assess attention; short-term memory; working memory; scene recall and situational reconstruction; visual-conceptual and visual-motor tracking; orientation; executive function; language comprehension and expression; logical thinking; and fine motor control. To minimize interference from the nondominant hand during the painting tasks, the experimenter provided appropriate assistance, such as stabilizing the tablet. Upon completion of the testing, the Empatica 4 wristband was removed from the participant’s wrist, physiological data were retrieved and downloaded through the Empatica 4 Connect portal, and cognitive data were exported from the cloud.

Data Segmentation and AlignmentOverview

For the collected multisource data, such as EDA, interbeat interval (IBI) for describing heart rate variability (HRV), and cognitive data, ensuring the integrity, continuity, and temporal alignment of the data is crucial. As participants perform cognitive tasks, the tablet automatically records timestamps for each test, providing a reference for aligning physiological data. Thus, we align the EDA and IBI data using the start and end times of each test, as described in the procedures listed below.

EDA Time Series Processing

Considering that the EDA.csv file downloaded from the cloud only contains information about the start time of the session and the sampling rate, it lacks the timestamps corresponding to each second of the signal. To address this deficiency, we generate a timestamp every 4 data points based on the session start time and the sensor sampling rate (4 Hz). Subsequently, we align the timestamps of the cognitive tests with the EDA series timestamps, thereby extracting the EDA signal segments corresponding to the specific cognitive tests.

IBI Time Series Processing

Due to the automatic discarding of unidentifiable heartbeats by the Empatica 4 wristband during measurement, the IBI.csv file contains discontinuities that do not match the actual measuring intervals. It is crucial to accurately identify and fill these measurement gaps to ensure data integrity in the analysis of IBI time series for various test tasks. Following the suggestions of Rafi et al [], this study limits the physiologically feasible range for IBI to within 2 seconds, and any data beyond the threshold was automatically labeled as a measurement gap. Subsequently, cubic spline interpolation [] is used to estimate the missing values within these gaps. The overall continuity of the IBI dataset in the temporal dimension is optimized by using a curve-fitting method based on available data points. Finally, new timestamps are added to the IBI data, aligning the timestamps of cognitive tests with those of the IBI series.

Data PreprocessingOverview

Commercial wearable devices are prone to artifacts, measurement gaps, or deviations from the measurement regime during data recording [,]. To ensure reliable information is extracted from field-collected data, rigorous preprocessing is required to filter noise and artifacts and restore the original signal. Considering the differences among EDA, IBI, and digital parameters, we will detail the preprocessing methods for these metrics.

EDA Signal Preprocessing

EDA serves as a biosignal, mirroring the individual physiological and emotional states, and consists mainly of slowly varying tonic and rapidly fluctuating phasic activities. Tonic activity, also known as skin conductance level (SCL), primarily reflects the physiological activity level of an individual at rest, indicating the continuous regulation of the autonomic nervous system. In contrast, phasic activity, or SCR, is a rapid and transient physiological response to specific stimuli, revealing an individual’s adaptability and reactivity to sudden events. To enhance EDA signal quality, we propose a multistage automatic artifact removal method, including artifact correction, signal decomposition, and overlapping sliding time windows, as shown in . The specific steps involved are:

Low-pass filtering: EDA signal is filtered using a first-order Butterworth low-pass filter with a cutoff frequency of 0.6 Hz [], which preserves its low-frequency components and eliminates high-frequency noise.Artifact detection: EDAexplorer [] is used to detect artifacts in the filtered signal, identifying and marking anomalies within the signal to provide a basis for data repair.Cubic spline interpolation: apply cubic spline interpolation to the identified artifact segments, using segmented cubic polynomials to approximate missing data points while ensuring continuity in function values and their first and second derivatives, thereby smoothly completing the missing data.Signal decomposition: by solving the convex optimization approach, cvxEDA [] separates the signal into tonic and phasic components, enabling enhanced analysis and interpretation of the underlying physiological mechanisms within the EDA signal.Component filtering: refilter the decomposed tonic and phasic components using a low-pass Butterworth filter to eliminate negative SCR and SCL values, enhancing the signal quality.Time window segmentation: segment the processed tonic and phasic components into overlapping time windows of 60 seconds with a step size of 1 second to facilitate subsequent feature extraction.‎

Figure 2. Electrodermal activity signal preprocessing flow. IBI Signal Preprocessing

Wearable devices commonly use photoplethysmography sensors to monitor the continuous variations in interbeat or R-R intervals. However, obtaining raw photoplethysmography data from the Empatica 4 wristband presents challenges, so we shifted to analyzing IBI data, which is more readily accessible. Analyzing IBI data allows for calculating HRV, reflecting the variations in time between consecutive heartbeats. Although Empatica 4 offers convenience and noninvasiveness for recording HRV, it still faces issues such as artifacts or measurement gaps [,]. Therefore, we initially adopted 4 artifact detection rules to identify artifacts, as shown in . Subsequently, detected artifacts were interpolated using cubic spline interpolation to fill in missing values. Finally, the cleaned IBI data was segmented into overlapping time windows of 60 seconds with a 1-second step size, creating datasets for subsequent feature extraction.

Textbox 1. List of interbeat interval artifact detection rules.

Study and rule description

Rafi et al []

Discard any interbeat intervals that do not fall within the physiological range of 250-2000 milliseconds (equivalent to a heart rate of 30-240 beats per minute).

Malik et al []

Each interbeat interval should be at most 20% from the previous one.

Acar et al []

Calculate the average of the 9 interbeat intervals preceding the current interbeat interval. It should be removed if the current interbeat interval differs from this average by more than 20%.

Karlsson et al []

Remove any interbeat interval that differs by more than 20% from the average of its immediate preceding and succeeding interbeat intervals.Cognitive Data Preprocessing

Outlier removal and data consistency checks were performed manually to preprocess digital cognitive parameters. The specific steps include (1) format validation: ensuring that all data entries adhere to the required format specifications (eg, the time recorded in seconds and scores in numerical format) and correcting any inconsistencies; (2) range validation: checking that all data values fall within predefined acceptable ranges, such as ensuring reaction times are within a reasonable range of seconds; (3) continuity validation: assessing the continuity of the data, including verifying that the timestamps for each test are in sequential order and checking for any missing data points. Through these steps, we aim to identify and eliminate extreme outliers caused by user errors or external interference while ensuring the logical consistency of data format, range, and time series, thereby improving the overall quality and reliability of the data.

Multiscale Feature Extraction

In this study, we comprehensively analyzed data collected from the Empatica 4 wristband and tablet devices to explore participants’ physiological and cognitive responses to various cognitive tasks. Specifically, we used the FLIRT toolkit [] and NeuroKit2 [] to extract 39 EDA-related features (including 17 SCL features and 22 SCR features) from the EDA signals and 23 features (including HR and HRV) from the IBI data. Furthermore, we collected several cognitive parameters, including time, score, stroke, frequency, and curvature (variance and the ratio of 0 values) during each test. Detailed information regarding all these features is available in .

Dynamic Adaptive Feature Selection Based on Improved Harmony Search

Feature selection plays a critical role in handling high-dimensional datasets, as not all features impact the outcome. An excessive number of features can result in the curse of dimensionality and increased model complexity. Therefore, this study proposes a feature selection algorithm based on HSI to sift through extracted physiological and cognitive features. Analogous to musical notes, each feature represents a note that may or may not be selected into a subset. Musicians repeatedly adjust the notes to achieve the best harmony effect until reaching a satisfactory harmony. Similarly, the feature selection algorithm based on HSI continuously tunes parameters and modifies the generated feature subset to ensure diversity and avoid convergence to local optima. In particular, we integrate Hamming distance into the HSI algorithm to gauge the disparity between the newly generated harmony vector and the optimal vector in the harmony memory. The Hamming distance is used to assess the similarity between these vectors to fine-tune the search probability. A high Hamming distance leads to a moderate reduction in the search probability to promote exploratory efforts, while a low Hamming distance results in a moderate increase in leveraging known information. Finally, to evaluate the quality of the harmony effect, we minimize the average classification error rate of all base learners and the feature selection rate as optimization objectives. The detailed algorithm is provided in .

Dynamic Adaptive Stacking Classification Based on Improved Harmony Search

An essential goal of this study is to distinguish between healthy individuals and patients with MCI effectively. As mentioned, we opt for a subset of features demonstrating balanced performance across all base learners during the feature selection phase. Nevertheless, certain learners continue to demonstrate suboptimal performance, and merely stacking multiple base learners increases algorithmic complexity and computational demands. In essence, the selection of learners, akin to feature selection, constitutes a combinatorial optimization problem focused on enhancing classification performance. Thus, this study proposes using HSI to optimize the stacking of base learners. Unlike feature selection algorithms that use HSI, it leverages the accuracy of the current base learners and their quantity to guide hyperparameter adjustments. The strategy aims to mitigate the adverse effects of underperforming learners on the overall model while simultaneously enhancing model efficiency and minimizing computational costs. Furthermore, we selected the KNN, decision tree (DT), random forest (RF), Gaussian naive Bayes, SVM, multilayer perceptron, LR, gradient boosting DT, and XGBoost as base learners, with LR serving as the meta-learner. The detailed algorithm is provided in .

Statistical Analysis and Machine Learning Model

This study analyzed demographic characteristics, cognitive parameters, and physiological features using the Statistical Package for the Social Sciences (SPSS, version 22.0 for Windows, IBM). Initially, the Kolmogorov-Smirnov test was used to assess whether continuous variables such as age, years of education, hours of sleep, years of smart device usage, and cognitive data conformed to a normal distribution. Descriptive statistics were described using means (SD) for normally distributed variables. Subsequently, the t test was used for between-group comparisons to determine if there were significant differences between the HC group and the MCI group. In contrast, we used the nonparametric Mann-Whitney U test for nonnormally distributed variables to assess intergroup differences. For categorical variables such as gender and hand preference, data were described in terms of counts (percentages), and the chi-square test was used for group comparisons. The significance level for all statistical analyses was set at P<.05. Furthermore, to validate the performance of the proposed detection framework, experiments were conducted using a 5-fold cross-validation approach, using 4 evaluation metrics (accuracy, precision, recall, and F1-score) to assess the classification outcomes. All learners used the HSI for feature selection, and the average values were used as the final classification results.

ResultsStatistical Comparison of EDA, HRV, and Cognitive Features Between Groups

We conducted statistical analyses on features extracted from EDA, HRV, and cognitive tasks to identify the key features distinguishing between healthy individuals and patients with MCI. As shown in , red squares indicate significant differences (P<.05) between the 2 groups on specific features during certain cognitive tasks, and the depth of the color reflects the degree of significance of these differences.

‎

Figure 3. Results of the t tests performed on features extracted from electrodermal activity, heart rate variability, and cognitive tasks between groups. HRV: heart rate variability; SCL: skin conductance level; SCR: skin conductance response.

Based on observations from A, relatively few SCL features distinguish between patients with MCI and healthy individuals, including the SD, median, root mean square, and SD of spectral power. Conversely, B reveals more significant differences in SCR features, including mean, energy, amplitude, rise time, delay time, and width in the time domain and mean in the frequency domain. These findings highlight several key points: (1) the SD and root mean square of SCL indicate variability and instability in physiological responses, with differences between patients with MCI and healthy individuals reflecting disparate levels of physiological variability; (2) statistical differences in the SD of SCL spectral power between the groups indicate that patients with MCI exhibit significantly different physiological responses within specific frequency ranges, possibly related to impaired cognition associated with MCI; (3) statistical differences in SCR time domain features between groups indicate that patients with MCI exhibit variations in the intensity and timing of physiological responses, suggesting impaired regulatory capabilities of their nervous systems; and (4) variations in the mean values in the SCR frequency domain indicate that patients with MCI have different physiological response frequency distributions when processing stimuli compared with healthy individuals.

C depicts the statistical analysis results applied to features extracted from HRV between patients with MCI and healthy controls. The analysis indicates statistically significant distinctions in HRV indices such as SDNN (SD of N-N intervals), RMSSD (root mean square of successive differences), PNN50 (percentage of successive R-R intervals > 50 ms), LF/HF (low-frequency to high-frequency ratio) ratio, SD2/SD1 (ratio of the SD2 and SD1 of Poincaré plot), and SampEn (sample entropy). These results highlight several key aspects which are (1) SDNN, RMSSD, and PNN50, which quantify overall and short-term heart rate variations, reveal disparities in autonomic nervous system functioning between the groups; (2) the LF/HF ratio reflects imbalances between sympathetic and parasympathetic nervous activities, indicating autonomic dysregulation in patients with MCI relative to controls; and (3) SD2/SD1 and SampEn focus on balance between long-term and short-term variability, as well as the complexity and irregularity of HRV, illustrating differences in autonomic nervous system adaptability and complexity between groups. Finally, observations from D reveal that (1) multiple cognitive tests using time and scores as indicators can significantly distinguish between healthy individuals and patients with MCI, and (2) features such as handwriting, frequency, and curvature show varying degrees of significant differences across different drawing tasks.

Performance of the DAELF-HSI in Mild Cognitive Impairment Detection

lists the classification outcomes of the DAELF-HSI compared with 6 machine learning models and the application of HSI for feature selection in these models. Following a thorough assessment involving 100 iterations of 5-fold cross-validation, the DAELF-HSI demonstrated an average accuracy of 88.5%, surpassing the performance of other algorithms significantly. Moreover, it exhibited superior precision, recall, and F1-score metrics, achieving 89.1%, 88.7%, and 88.9%, respectively, all at notably high levels. When HSI-based feature selection was not employed, the SVM model outperformed other machine learning algorithms with an accuracy of 79.6%. However, after integrating HSI feature selection, models like KNN-HSI and multilayer perceptron-HSI displayed improved performance, surpassing that of SVM-HSI. Noteworthy is the consistent enhancement in performance observed across all machine learning models upon the introduction of HSI feature selection, with accuracy improvements ranging from 3% to 5%, resulting in all models achieving accuracy levels exceeding 81%. It highlights the efficacy of the feature selection algorithm based on HSI in identifying crucial features that enhance the predictive capabilities of the machine learning models under investigation.

Table 2. Performance comparison of Dynamic Adaptive Ensemble Learning Framework based on an Improved Harmony Search with 6 machine learning models and applying Improved Harmony Search to model feature selection.MethodsAccuracy, mean (SD)Precision, mean (SD)Recall, mean (SD)F1-score, mean (SD)KNNa0.792 (0.018)0.775 (0.023)0.806 (0.029)0.790 (0.019)SVMb0.796 (0.024)0.788 (0.025)0.800 (0.020)0.794 (0.027)GNBc0.771 (0.022)0.787 (0.019)0.796 (0.022)0.791 (0.025)DTd0.786 (0.025)0.791 (0.033)0.783 (0.025)0.787 (0.033)MLPe0.790 (0.017)0.807 (0.025)0.794 (0.026)0.800 (0.019)LRf0.784 (0.020)0.789 (0.034)0.791 (0.028)0.790 (0.038)KNN-HSIg0.842 (0.017)0.848 (0.018)0.830 (0.027)0.839 (0.024)SVM-HSI0.831 (0.022)0.833 (0.023)0.839 (0.030)0.825 (0.027)GNB-HSI0.815 (0.023)0.815 (0.030)0.800 (0.025)0.807 (0.027)DT-HSI0.827 (0.019)0.833 (0.022)0.809 (0.028)0.821 (0.031)MLP-HSI0.836 (0.022)0.839 (0.024)0.842 (0.025)0.840 (0.018)LR-HSI0.817 (0.025)0.815 (0.027)0.823 (0.031)0.819 (0.035)DAELF-HSIh (ours)0.885 (0.020)0.891 (0.021)0.887 (0.024)0.889 (0.025)

aKNN: k-nearest neighbors.

bSVM: support vector machines.

cGNB: Gaussian naive Bayes.

dDT: decision tree.

eMLP: multilayer perceptron.

fLR: logistic regression.

gHSI: Improved Harmony Search.

hDAELF-HSI: Dynamic Adaptive Ensemble Learning Framework based on an Improved Harmony Search.

Similarly, presents the effectiveness of DAELF-HSI in classification compared with 5 ensemble learning models, augmented by HSI feature selection. In line with the findings in , DAELF-HSI maintains its superior performance. Noteworthy is that the ensemble learning techniques outperform the individual machine learning models discussed earlier, with XGBoost achieving a commendable average accuracy of 81.9%. The Bagging model, which uses KNN as a base learner, demonstrates improved performance compared with the stand-alone KNN model. However, the bagging model using ensemble SVM as a base learner falls short of expectations, slightly underperforming compared with the stand-alone SVM model, possibly due to the inherent instability advantages associated with SVM.

Table 3. Performance comparison of Dynamic Adaptive Ensemble Learning Framework based on an Improved Harmony Search with 5 ensemble learning models and applying HSI to model feature selection.MethodsAccuracy, mean (SD)Precision, mean (SD)Recall, mean (SD)F1-score, mean (SD)Bag (KNNa)0.794 (0.022)0.791 (0.031)0.794 (0.026)0.792 (0.027)Bag (SVMb)0.779 (0.011)0.788 (0.021)0.780 (0.017)0.784 (0.021)RFc0.804 (0.015)0.802 (0.037)0.816 (0.027)0.809 (0.025)GBDTd0.813 (0.019)0.816 (0.028)0.813 (0.027)0.814 (0.019)XGBoost0.819 (0.015)0.825 (0.024)0.810 (0.026)0.817 (0.024)Bag (KNN)-HSIe0.833 (0.024)0.842 (0.029)0.821 (0.034)0.831 (0.036)Bag (SVM)-HSI0.811 (0.016)0.812 (0.024)0.819 (0.027)0.815 (0.024)RF-HSI0.854 (0.017)0.858 (0.018)0.847 (0.021)0.852 (0.024)GBDT-HSI0.848 (0.027)0.839 (0.025)0.852 (0.023)0.845 (0.016)XGBoost-HSI0.854 (0.022)0.860 (0.020)0.849 (0.028)0.854 (0.029)DAELF-HSIf (ours)0.885 (0.020)0.891 (0.021)0.887 (0.024)0.889 (0.025)

aKNN: k-nearest neighbors.

bSVM: support vector machines.

cRF: random forest.

dGBDT: gradient boosting decision tree.

eHSI: Improved Harmony Search.

fDAELF-HSI: Dynamic Adaptive Ensemble Learning Framework based on an Improved Harmony Search.

illustrates the box plots of 12 algorithms after 100 independent experiments across 4 evaluation metrics. The quartiles within each box plot depict algorithmic performance, while the red numbers above each box plot represent the mean values for the respective metrics. DAELF-HSI outstrips competing models across all metrics, demonstrating superior stability and an absence of significant outliers. Conversely, SVM-HSI and GNB-HSI exhibit more outliers, indicating challenges in achieving precise model fits for specific data distributions, leading to notable performance fluctuations. Although LR-HSI and Bag (KNN)-HSI do not display outliers, their wide IQR suggests instability across diverse feature distributions. Notably, RF-HSI demonstrates significant outliers in the F1-score, highlighting the model’s vulnerability to certain data distributions or feature sets.

‎

Figure 4. Comparison of evaluation metrics between Dynamic Adaptive Ensemble Learning Framework based on an Improved Harmony Search and various machine learning models applying harmony search improved feature selection. DT: decision tree; GBDT-HSI: gradient boosting decision tree-harmony search improved; GNB: Gaussian naïve Bayes; LR: logistic regression; KNN: k-nearest neighbors; RF-HSI: random forest-harmony search improved; SVM: support vector machines. Analysis of the Optimization Module of the DAELF-HSI

presents the frequency analysis of physiological features (including SCL, SCR, and HRV) and digital cognitive parameters during 100 iterations of the DAELF-HSI model. Notably, SCR decay time (SCR feature set), PNN50, LF/HF (HRV feature set), and time (cognitive feature set) are highlighted for their importance in distinguishing between the 2 groups, appearing in over 80% of the selections. However, certain SCR features (eg, SCR amplitude, SCR width, and mean band) and HRV features (eg, SDNN, RMSSD, and mean HR), despite showing statistically significant differences (P<.05; ) in group differentiation, were infrequently chosen. This infrequency suggests that these parameters are highly correlated with features previously selected, leading the HSI feature selection algorithm to deem them redundant. Overall, the proposed HSI feature selection optimization algorithm can identify critical features, address redundancy, and accurately detect patients with MCI with limited features.

‎

Figure 5. The number of times each skin conductance level, skin conductance response, heart rate variability, and cognitive feature was selected for mild cognitive impairment detection.

In addition, illustrates the frequency distribution of various base learners across 100 iterations within the DAELF-HSI model. Specifically, RF-HSI and XGBoost-HSI are the predominant selections, highlighting their substantial contributions to model efficacy and demonstrating the model’s proficiency in selecting optimal learners. Conversely, GNB-HSI, Bag (SVM)-HSI, and GBDT-HSI are chosen less frequently. This bias in selection can be primarily attributed to 2 factors: first, these base learners inherently exhibit lower accuracy, leading the adaptive hyperparameter strategy of the DAELF-HSI to assign them reduced weights; second, a high level of correlation exists between these learners and other high-performing ensemble members, making their inclusion less impactful due to redundant error distributions and similarities in learner characteristics with more effective alternatives.

‎

Figure 6. The number of times each base learner was selected for mild cognitive impairment detection. DT: decision tree; GBDT-HSI: gradient boosting decision tree-harmony search improved; GNB: Gaussian naïve Bayes; LR: logistic regression; KNN: k-nearest neighbors; RF-HSI: random forest-harmony search improved; SVM: support vector machines.

Finally, we substantiate the importance of feature selection and stacking optimization stages within the DAELF-HSI through ablation experiments. As delineated in , a total of 3 experimental models were structured to assess the impact of omitting 1 or both optimization stages, where “✓” denotes the inclusion of that stage. The models are configured as follows:

Model A uses 11 machine learning models as base learners and LR as the meta-learner, incorporating all features.Model B uses 11 machine learning models as base learners with LR as the meta-learner, but only includes features selected through the HSI algorithm.Model C integrates the HSI for stacking with 11 machine learning models, incorporating all features.Table 4. Performance comparison of different ablation modules applied to the DAELF-HSIa.MethodsHSIb features selectionHSI learners stackingAccuracy, mean (SD)Precision, mean (SD)Recall, mean (SD)F1-score, mean (SD)Model A

0.826 (0.023)0.819 (0.026)0.831 (0.024)0.825 (0.029)Model B✓
0.837 (0.027)0.831 (0.028)0.839 (0.030)0.835 (0.035)Model C
✓0.854 (0.025)0.858 (0.019)0.849 (0.022)0.853 (0.025)DAELF-HSI✓✓0.885 (0.020)0.891 (0.021)0.887 (0.024)0.889 (0.025)

aDAELF-HSI: Dynamic Adaptive Ensemble Learning Framework based on an Improved Harmony Search.

bHSI: Improved Harmony Search.

According to the results in , model A shows lackluster classification performance in the absence of the 2 optimization stages, primarily due to the poor performance of certain base learners and the inclusion of numerous redundant features during training. Model B, which incorporates HSI-selected features, demonstrates a slight performance improvement of around 1%, suggesting that merely reducing feature redundancy is insufficient to enhance the output of underperforming base learners significantly. Furthermore, Model C, which solely incorporates the base learner stacking optimization stage, improves performance by nearly 3%, underscoring the importance of high-quality base learners in stacking algorithms. Notably, the simultaneous implementation of both optimization stages substantially enhances the model’s performance, highlighting the essential contribution of each stage to the overall efficacy of the algorithm.

Performance Evaluation of DAELF-HSI Across Different Datasets

As shown in , we also used 4 metrics (including accuracy, sensitivity, specificity, and area under the curve [AUC]) to evaluate the binary classification performance of the classification model for patients with MCI versus healthy individuals in internal and external test datasets. Specifically, the model demonstrated excellent performance in the development dataset, with accuracy, sensitivity, and specificity at 88.4%, 86.1%, and 90.9%, respectively. In the internal test dataset, the model’s accuracy slightly decreased to 85.5%, while sensitivity remained unchanged at 86.1%, and specificity dropped to 84.9%, suggesting that a small number of healthy individuals were misclassified as patients with MCI, but overall performance remained satisfactory. However, in the external test dataset, the model’s accuracy, sensitivity, and specificity decreased by 3.9%, 0.4%, and 8.1%, respectively. Nevertheless, these results demonstrate the model’s effectiveness, especially when evaluated on new and diverse samples. Furthermore, the AUC, which serves as an indicator of the model’s validity, was recorded at 0.945 (95% CI 0.903-0.986), 0.912 (95% CI 0.859-0.965), and 0.904 (95% CI 0.846-0.962) across the 3 datasets, as illustrated in . The AUC value reflects the classification performance of the model on different datasets. The DAELF-HSI model has an AUC value above the threshold of 0.9 on these datasets, demonstrating excellent sensitivity and specificity in detecting patients with MCI. In other words, the model cannot only effectively identify patients with MCI but also reduce the probability of misjudgment, enhancing reliability in clinical applications.

Table 5. The diagnostic value of the classification model in differentiating between healthy older adults and patients with mild cognitive impairment was assessed using internal and external testing datasets.DatasetAccuracySensitivitySpecificityArea under the curve (95% CI)P valueDevelopment dataset0.8840.8610.9090.945 (0.903-0.986)<.001Internal testing dataset0.8550.8610.8480.912 (0.859-0.965)<.001External testing dataset0.8450.8570.8280.904 (0.846-0.962)<.001‎

Figure 7. The receiver operating characteristic curve of our model across different datasets. Clinical Utility Analysis

To assess the practical clinical value of the proposed model, we performed decision curve analysis (DCA) on the development, internal testing, and external testing sets, presenting the model’s decision curve along with the “Treat all” and “Treat none” strategies, as shown in . Specifically, the model’s decision curve demonstrated clear clinical benefits in the development set, with its net benefit consistently exceeding that of the “Treat all” and “Treat none” strategies. Notably, in the 0 to 0.85 threshold range, the model’s net benefit declined from 1 to 0.75 and further decreased to 0.3 in the higher threshold range, indicating that the model effectively avoids overtreatment in low-risk patient groups. In the internal testing set, the decision curve remained stable, with a gradual decline in net benefit that aligned with expectations, suggesting that as the threshold increased, the model opted for fewer treatment decisions, and validated the model’s effectiveness in high-risk patient populations and its alignment with the clinical need to reduce unnecessary treatments. However, the model’s decision curve showed slight variations in the external testing set compared with the other sets. The net benefit decreased from 1 to 0.2 in the 0-0.8 threshold range, indicating reduced efficacy in this range. However, in the 0.8-0.9 range, the net benefit rose from 0 to 0.4, suggesting that adjusting the threshold appropriately enhanced the model’s predictive accuracy. In the 0.9-1 threshold range, the net benefit dropped again to 0.2, which may indicate cautious predictions in the high-risk zone, thus limiting its clinical utility. In conclusion, the model demonstrated a net benefit consistently higher than both the “Treat all” and “Treat none” strategies across the 3 sets, particularly in the mid-to-low threshold ranges, highlighting that the model can effectively guide clinical decision-making, reduce unnecessary treatments, and improve early disease detection and intervention efficiency.

‎

Figure 8. Decision curve analysis on different datasets, showing the model’s decision curves for the binary classification task, along with the “Treat all” and “Treat none” strategies. (A) Development set, (B) internal testing set, and (C) external testing set.
DiscussionInterpretability of Physiological Features

View original article

JMIR MEDICAL INFORMATICS

分享书签

0 0 0 0 0 0 0

More from this channel

A Dynamic Adaptive Ensemble Learning Framework for Noninvasive Mild Cognitive Impairment Detection: Development and Validation Study

留言 (0)