Exploring Digital Biomarkers of Illness Activity in Mood Episodes: Hypotheses Generating and Model Development Study


Introduction

Mood disorders, including bipolar disorder (BD) and major depressive disorder (MDD), are ranked among the top 25 leading causes of disease burden worldwide [] and are associated with recurrent depressive and manic episodes. Manic episodes are characterized by increased activity and self-esteem, reduced need for sleep, and expansive mood and behavior, whereas during depressive episodes, patients experience decreased energy and activity, sadness, low self-esteem, and social withdrawal [-]. These changes in mood, sleep, and activity during mood episodes translate to changes in physiological data that novel research-grade wearables can capture with high precision in real time [,]. Linking these digital signals with illness activity could potentially identify digital biomarkers [].

Biomarkers are characteristics that are measured as an indicator of pathogenic processes (disease-associated biomarkers) or responses to an exposure or intervention (drug-related biomarkers) []. These can include molecular, histological, radiographic, or physiological characteristics. Digital biomarkers are objective, quantifiable, and physiological, and behavioral measures are collected using digital devices that are portable, wearable, implantable, or digestible []. Traditional biomarkers can be invasive and expensive to measure and are difficult to collect over time, thus giving an incomplete view of the complexity and dynamism of the disease. Alternatively, digital biomarkers are usually noninvasive, modular, and cheaper to measure, and they provide access to continuous and longitudinal measurements, both qualitative and quantitative. Moreover, they offer novel ways of measuring health status by providing perspectives into diseases that were unavailable before, which can supplement and enhance conclusions from traditional biomarkers []. Digital biomarkers have the potential to redefine diagnosis, improve the accuracy of diagnostic methods, enhance monitoring, and personalize interventions [], leading to precision medicine, especially in psychiatric diseases [].

In the last decade, there has been an exponential growth in the number of digital biomarker studies in the health domain, especially in cardiovascular and respiratory diseases []. Wearables are the most common type of digital devices used in digital biomarker studies, especially those incorporating accelerometer sensors that measure physical activity []. Wearable devices include wristbands, smartwatches, smart shirts, smart rings, smart electrodes, smart headsets, smart glasses, and so on. Wrist-worn devices are the most common type of wearable device in mental health studies and have shown to be effective in diagnosing anxiety and depression. However, none of the studies used it for treatment. The most commonly used category of data for model development was physical activity data, followed by sleep and heart rate (HR) data []. There are several areas in health care in which wearable devices have shown potential, including monitoring, diagnosis, treatment, and rehabilitation of diseases. Even though wearables have shown accurate activity-tracking measurements and are acceptable for users [], including feasibility studies in people with mental health problems [], their implementation in usual clinical practice is still challenging [].

Wearables collecting actigraphy, the noninvasive method of monitoring human rest and activity [], can capture altered sleep rhythms in remitted BD [] and also depressive symptoms []. In addition, actigraphy data from wearables have shown to accurately predict mood disorder diagnoses and symptom change []. Moreover, wearables collecting blood pulse have shown differences in HR variability (HRV) between BD and healthy controls (HCs) [], as well as between affective states in BD []. In addition, people with bipolar and unipolar depression and suicidal behavior have long shown autonomic alterations that can be captured as hyporeactive electrodermal activity (EDA) [,], and in recent years, research-grade wearables have incorporated sensors allowing continuous EDA collection []. With these upgrades, in the latest years, it is now feasible to monitor mood changes in patients with MDD [] and also predict the presence and severity of depressive states in BD and MDD with promising accuracy using wearable physiological data []. Despite these promising results, the specific roles of these digital signals and their longitudinal potential to measure illness activity and treatment response in mood disorders are still unknown.

The conjuncture of advances in machine learning [] and the improved precision of wearable devices [] may help identify physiological patterns of illness activity in mood disorders. Firstly, considering this promising background, we explored whether physiological wearable data could predict the severity of an acute affective episode at the intra-individual level (aim 1) and the polarity of an acute affective episode and euthymia among different individuals (aim 2). Secondarily, we explored which physiological data were related to prior predictions, generalization across patients, and associations between affective symptoms and physiological data.


MethodsStudy Design

A prospective exploratory observational study with 3 independent groups (): group A, patients on acute affective episodes, manic episodes in BD (n=2), major depressive episodes in BD (n=2) and MDD (n=2), and mixed features manic episodes in BD (n=2); group B, euthymic patients with BD (n=2) and MDD (n=2); and group C, HC (n=7). Potential participants were identified at the outpatient and the acute inpatient or hospitalization at home units by their clinicians (ie, psychiatrists). Physiological data were recorded across 3 consecutive time points for group A: T0-acute (T0): current acute affective episodes according to the Diagnostic and Statistical Manual of Mental Disorders–5 (DSM-5); T1-response (T1): symptom response, as more than 30% improvement in the Young Mania Rating Scale (YMRS) score or the 17-item Hamilton Depression Rating Scale (HDRS) score; and T2-remission (T2): symptomatic remission, with YMRS and HDRS score ≤7 []). Euthymic patients (group B) and HCs (group C) were recorded during a single session.

The inclusion criteria were as follows: (1) aged above 18 years; (2) having a diagnosis according to the DSM-5 [] criteria confirmed with the Structured Clinical Interview for DSM-5 Disorders []; and (3) willingness and ability to give consent (reconfirmed upon clinical remission). In addition, euthymic patients (group B) should also (4) score ≤7 on the YMRS and HDRS for at least 8 weeks []. HC (group C) should present no current or previous psychiatric disorder according to the DSM-5 criteria and confirmed using the Structured Clinical Interview for DSM-5 Disorders, excluding nicotine substance use disorder. Exclusion criteria for all groups were as follows: (1) concomitant severe cardiovascular or neurological medical conditions with a potential autonomic dysfunction, ongoing cardiovascular arrhythmia, or pacemaker; (2) comorbid current substance use disorder according to the DSM-5 criteria, excluding nicotine substance use disorder; (3) comorbid current psychiatric disorder with great interference of symptoms (eg, obsessive compulsive disorder with ritualized behaviors); (4) current pharmacological treatment with β-blockers or other pharmacological treatments affecting the autonomic nervous system; and (5) ongoing pregnancy.

Figure 1. Study design and recordings. BD: bipolar disorder; HC: healthy controls; HDRS: Hamilton Depression Rating Scale; MDD: major depressive disorder; SCID: Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders; T0: current acute Diagnostic and Statistical Manual of Mental Disorders–5 affective episodes; T1: symptoms’ response; T2: symptomatic remission; YMRS: Young Mania Rating Scale. Assessments

The following sociodemographic variables were collected: age, sex, DSM-5 psychiatric diagnoses [], medical and psychiatric comorbidities, years of illness duration, first-degree relative with mental illness, and drug misuse habits. Psychopathological assessments were conducted using the YMRS [,] for manic symptoms and the 17-item HDRS [,] for depressive symptoms. Clinical assessments were performed during a single session for euthymic patients (group B) and HCs (group C) and at 3 consecutive time points (T0-acute, T1-response, and T2-remission) for patients on acute affective episodes (group A), as described in .

Research-Grade Wearable Device for Recording

When choosing a wearable device for a research project, there are several factors that should be considered, including (1) the signals of interest to be captured (eg, stress-related and actigraphy); (2) the users who will be studied (eg, inpatients, outpatients, and HCs); (3) the pragmatic needs of the study (eg, budget, battery life, placement of the devices, and confidentiality of participants); (4) establishing assessment procedures (eg, stress elicitation task, resting, and sleep); and (5) performing qualitative and quantitative analyses on resulting data (eg, visually inspecting the data registered, quantifying data loss, assessing the quality of data, and comparing the data of different wearable devices) []. Considering the previous points, the E4 wristband from Empatica [] was the preferred wearable device for the purpose of our study for several reasons. First, the E4 has shown accuracy in measuring HR, HRV [], and EDA compared with laboratory conditions [], as well as for sleep staging []. As previously mentioned, these physiological parameters have been shown to be altered in mood disorders and mood episodes [-,-]. Second, the E4 has been validated in scientific research for detecting emotional arousal, stress [,], and mental effort [] using the aforementioned physiological signals. Furthermore, the E4 has proven to be useful in predicting depressive symptoms in MDD with low relative errors [,], predicting self-reported depressive states [], and identifying and quantifying the severity of anxiety states []. In patients with BD, the E4 has shown to be useful in distinguishing manic from euthymic mood states [,]. Third, the inpatients included in the study were in a highly restricted setting, which would not allow the use of user-dependent wearables or devices providing external communication (eg, an internet connection). This requirement was fulfilled by the E4 device. Finally, the data recorded by the E4 are of high precision and quality [,], with minimal data loss when performing the analyses (see the Results section).

Recording Procedure of Physiological Data

For each recording, patients and HCs were provided with an E4 wristband [] () for approximately 48 hours (limited by battery life). The research team collected the wearables after each session. Individuals’ behavior was not externally influenced in any manner, further to the requirement of wearing the wristband. Patients with acute affective episodes (group A), during their psychiatric admission in the inpatient unit, were not allowed to leave the hospital at any point until discharge, as it is the standard practice with inpatients. T0-acute, T1-response, and T2-remission recordings were usually carried out in this setting. This was not the case with patients at the hospitalization at home or outpatient units (a minority of all cases), in which patients were not subject to mobility restrictions. In all cases, both for patients and HCs, participants were asked to wear the wristband during their daily life, with little to no interference in their behavior. They were also asked to put the wristband themselves at the beginning of the recording while researchers checked for adequate contact between the sensors and the skin wrist. Participants received instructions to remove the device when taking a shower to preserve the integrity of the device.

The E4 wristband has sensors that collect physiological data at different sampling rates. The physiological data signals from each recording session were collected from the following channels and sampling rates as raw data: 3D acceleration (ACC) in space over time on an x-, y-, and z-axis (ACC, 32 Hz); EDA (4 Hz); skin temperature (TEMP, 4 Hz); and blood volume pulse (BVP, 64 Hz); or in a processed format: interbeat intervals (IBIs, the time between 2 consecutive heart ventricular contractions) and HR (1 Hz). The BVP signal is obtained using a photoplethysmography sensor that measures volume changes in the blood. Empatica uses 2 algorithms on the BVP signal to construct an IBI with which HR (and HRV) can be calculated. The 2 algorithms are optimized to detect heartbeats and discard beats that contain artifacts [,].

Preprocessing of Physiological Data

Owing to the naturalistic setting of the recording sessions, the data obtained from the E4 wristband are inherently noisy. For instance, some patients show low levels of compliance during an affective episode (eg, mania), which can lead to poor skin contact from the device, hence inaccurate readings for certain channels, or complete removal of the wearable device, resulting in unusable data. To that end, we removed invalid physiological data enforcing the rules-based filter by Kleckner et al [] and an additional rule to remove HR values that exceed the physiologically plausible range (25-250 bpm) to quality control the raw data and remove physiologically impossible recordings (). Quality controlling physiological data from wearable devices is common practice, as this type of data is particularly noisy, and failing to quality control the data favors spurious correlations, and previous works have advised against imputing data in this scenario [].

We did not use IBI data because of the disproportionately high number of missing values (approximately 70%) relative to data from different channels [], especially because it is only a derivation of BVP. Therefore, we did not calculate HRV features. In sum, a total of 7 channels from the E4 device (ACC_X, ACC_Y, ACC_Z, BVP, EDA, HR, and TEMP) were used as physiological data to build the prediction models. Different time units (µ) and window lengths (w) were explored during tuning, and the best combination was selected. Because the sampling rate varied across different channels, the recordings were time aligned. If a channel’s sampling rate was higher than 1 Hz, that channel was downsampled by taking the average value across samples within µ. We compared different time units (µ=1, 2, 4, 32, and 64 Hz), and we used 1 Hz because it showed the best performance; therefore, a time unit µ=1 second was set across all channels. Upon time alignment, each recording was then segmented into a predefined number of segments using a tunable window length (w), taking values in real-time seconds (s) (only powers of 2, specifically from 20 [1 s] to 211 [2048 s], were explored for computational convenience). Of note, by tuning the hyperparameter w, an interesting pattern appeared across tasks, whereby a value of 25 (ie, 32 s) emerged as an optimal point, whereas smaller or higher values were associated with a deterioration in validation performance (U-shaped performance); therefore, µ=1 Hz and w=25 (32) seconds were used for analyses as the best-performing algorithm ().

To obtain an equal number of segments from each class for model evaluation, we randomly selected 20 segments from each session and stored them as a held-out test set, which was never observed by the model during either training or validation. We then randomly assigned the remaining segments to the train and validation sets with ratios of 80% and 20%, respectively. Each segment was normalized (scaled to [0, 1]) using the per-channel global (across all segments) minimum and maximum values derived from the train set.

Table 1. Rules-based filter for invalid physiological data.RulesFilter for invalid dataRange1To prevent “floor” artifacts (eg, electrode loses contact with skin) and “ceiling” artifacts (circuit is overloaded)—EDAa not in a valid range0.05 to 60 µSb2EDA changes too quickly—EDA slope not in a valid range−10 to +10 µS/second3Skin temperature suggests the EDA sensor is not being worn—skin temperature not in a valid range30 to 40 °C4cHRd not in a valid range25 to 250 bpme5Transitional data surrounding segments identified as invalid via the preceding rules—account for transition effectsWithin 5 seconds

aEDA: electrodermal activity.

bµS: microsiemens.

cAddition to the algorithm used by Kleckner et al [].

dHR: heart rate.

ebpm: beats per minute.

Data AnalysesTasks

The recording segments produced with the preprocessing steps described earlier were used in supervised learning experiments as input to the supervised models. For aim 1, models were trained on 3-class classification tasks (T0-acute, T1-response, and T2-remission) for each individual on an acute affective episode (manic BD, depressed BD, depressed MDD, and mixed BD). For aim 2, one model was trained on a 7-class classification task (manic BD, depressed BD, mixed BD, depressed MDD, euthymic BD, euthymic MDD, and HCs).

Segments from each class under a given task were extracted in the same number to obtain perfectly balanced classes. As sets were designed to be perfectly balanced, we adopted accuracy as our primary metric but also reported the F1-score, precision, and recall and computed the area under the receiver operating characteristic (AUROC) curves. It should be noted that ours is a multiclass setting, but as we had perfectly balanced sets, micro-, macro-, and weighted averages coincided. For the AUROC curves, the one-vs-rest multiclass strategy was adopted, also known as one-vs-all, which amounts to computing a receiver operating characteristic (ROC) curve for each class, so that at a given step, a given class is regarded as positive and the remaining classes are lumped together as a single negative class.

As part of our exploratory data analysis, to quantify the association between physiological data and affective symptoms measured by the YMRS and HDRS scale items, their normalized mutual information (NMI) was computed.

For each task, with the exception of the one about distinguishing members of a group of only HCs, as we were interested in testing the degree to which a model can generalize to different individuals, unseen during training, and sharing the same psychiatric label (diagnosis and psychopathological status), we prepared a test set of segments from recordings collected from an independent group of individuals. Therefore, the model was tested on this extra, independent holdout set to obtain an estimate of the out-of-sample generalization performance.

Model

We elected a Bidirectional Long Short-Term Memory (BiLSTM) model [] as our model architecture. BiLSTM is a type of recurrent neural network (RNN), a class of deep learning model specifically designed to handle sequence data such as time series. RNNs process streams of data one time step at a time, and they store information regarding previous time steps in a hidden unit, such that the model output at each time step is informed by the current time step as well as by previous ones. Long short-term memory (LSTM) units represent an improvement over vanilla RNNs, as they address gradient instability by modeling the hidden state with cells that decide what to keep in memory and what to discard. This feature makes LSTM more efficient in capturing long-range dependencies. In contrast to a simple LSTM, BiLSTM reads the input sequence in 2 directions, from start to end and from end to start, thereby allowing for a richer representation. Although other deep learning architectures suitable for time series have been developed (more recently, the transformer []), as the aim of this work was exploratory rather than benchmarking different models, we contented ourselves with a single popular architectural choice for time series. By the same token, we used a simple shallow BiLSTM with 128 hidden units and tanh activation, followed by a single dense layer with softmax activation, to output the possible classes. The BiLSTM model was trained using the Adam optimizer [] for 120 epochs with a learning rate of 0.001 and a batch size of 32 to minimize the cross-entropy between the ground-truth distribution over classes and the probability distribution of belonging to such classes outputted by the last network layer. To reduce overfitting, dropout [] and early stopping were used. The choice of hyperparameters was based on a random search that yielded the best performance in the validation set.

Permutation Feature Importance

To assess the channels’ individual impact on the test set performance in the aforementioned tasks, we adopted a perturbation-based approach. For each channel at a time, we randomly permuted its values in the test set segments and computed the difference in performance relative to the baseline model. We chose this approach because it has a straightforward interpretation and provides a highly compressed, global insight into the importance of the channels. Agreement on channels’ relevance across different tasks was measured using the Kendall W.

Code and Data Availability

The codebase was written in Python (version 3.8; Python Software Foundation), where the deep learning models were implemented in TensorFlow and developed on a single NVIDIA RTX 2080Ti. The repository for this study can be found on the internet [].

Ethics Approval and Confidentiality

This study was conducted in accordance with the ethical principles of the Declaration of Helsinki and Good Clinical Practice and the Hospital Clinic Ethics and Research Board (HCB/2021/104). All participants provided written informed consent before their inclusion in the study. All data were collected anonymously and stored encrypted in servers complying with all General Data Protection Regulation and Health Insurance Portability and Accountability Act regulations.


ResultsOverview

A total of 35 sessions from 12 patients (manic, depressed, mixed, and euthymic) and 7 HCs (mean age 39.7, SD 12.6 years; 6/19, 32% female) were analyzed, totaling 1512 hours recorded. The median percentage of data per recording session dropped from further analysis of quality control was 11.05 (range 2.50-34.21). A clinical demographic overview of the study sample is presented in .

Table 2. Clinical demographic overview of the study sample.DiagnosisAge (years)SexHDRSa scoreYMRSb score


T0cT1dT2eT0T1T2Manic BDf40Male5442482Manic BDg21Male35423151Depressed BDh33Male2364000Depressed BDg,h36Male17123242Mixed BD30Female84430205Mixed BDg40Male112129103Depressed MDDi57Male33137720Depressed MDDg45Male27117411Euthymic BD54Male3—j—0——Euthymic BDg61Male1——3——Euthymic MDD60Female4——0——Euthymic MDDg60Male3——0——HCk32Female0——0——HCg34Male0——0——HC28Female0——1——HC29Male0——2——HC31Male2——1——HC32Female1——3——HC31Female0——1——

aHDRS: Hamilton Depression Rating Scale.

bYMRS: Young Mania Rating Scale.

cT0: current acute Diagnostic and Statistical Manual of Mental Disorders–5 affective episodes or only register for euthymic patients and healthy controls.

dT1: symptoms’ response.

eT2: symptomatic remission.

fBD: bipolar disorder.

gThe recording segments extracted from the marked subjects were used to check the models’ ability to generalize to clinically similar subjects, unseen during training.

hAll registers performed at the hospitalization at home or outpatient units.

iMDD: major depressive disorder.

jEuthymic patients and healthy controls were recorded during a single session (T0).

kHC: healthy control.

Aim 1: Prediction of the Severity of an Acute Affective Episode at the Intra-individual Level

The 3-class classification tasks (T0-acute, T1-response, T2-remission; accuracy expected by chance: 1/3=33%) to predict the severity of an acute affective episode showed accuracies ranging from 62% (depressed BD) to 85% (depressed MDD). The generalization models on unseen patients showed accuracies ranging from 28% (depressed MDD) to 57% (manic BD; ). The confusion matrix is shown in . This means that the model showed moderate to high accuracies for classifying the severity of each acute affective episode, with the best prediction models classifying individuals with depressed MDD and manic BD. However, generalization of the models was of very low accuracy for depressed MDD and mixed BD (by chance; approximately 30%), of low accuracy (slightly above chance; >40%) for mixed BD, and of moderate accuracy (>55%) for manic BD.

The permutation importance analysis for the classification tasks for aims 1 and 2 is shown in . Kendall W was 0.383, indicating fair agreement in feature importance across both intra- and inter-individual classification tasks. ACC was the most relevant channel for predicting mania, whereas EDA and HR, followed by TEMP, were the most relevant channels for predicting both BD and unipolar depression (aim 1). The BVP channel did not change performance for either better or worse ().

Table 3. Prediction of the severity of an acute affective episode: model and generalization on unseen patients.Individuals with affective episodes and performance metricModelGeneralizationManic BDa
Accuracyb (%)7056.67
F1-score0.69780.5279
Precision0.69790.5381
Recall0.70000.5667
AUROCc0.69800.5432Depressed BD
Accuracyb (%)61.6741.67
F1-score0.61710.3968
Precision0.62730.4085
Recall0.61670.4167
AUROC0.61150.4067Mixed BD
Accuracyb (%)63.3330
F1-score0.63330.2576
Precision0.63330.3004
Recall0.63330.3068
AUROC0.63330.3012Depressed MDDd
Accuracyb (%)8528.33
F1-score0.84920.2451
Precision0.87740.2581
Recall0.85000.2833
AUROC0.86720.2856

aBD: bipolar disorder.

bAccuracy expected by chance for a 3-class classification task is 1/3=33%. Thus, accuracies above 33% suggest that the model can predict outcomes better than random guessing, and higher values for accuracy indicate better predictive capacity of the model. Note that the test set was designed to have the same number of samples in each class. This is reflected in the values of F1-score, precision, and recall being very close to each other and to that of accuracy.

cAUROC: area under the receiver operating characteristic.

dMDD: major depressive disorder.

Figure 2. Permutation importance analysis. The height of the bars shows the change in accuracy at test time upon scrambling a channel through a random permutation of its values. A positive (negative) permutation importance value means that scrambling that channel results in a drop (increase) in accuracy relatively to the baseline where original (nonpermuted) values were used across all channels, that is, the channel’s permutation deteriorates (improves) the performance. A “0” permutation importance value indicates that a random permutation of the channel’s values does not affect accuracy in either direction. For instance, electrodermal activity (EDA) shows a positive change in accuracy of 40% for the intra-individual depressed BD severity prediction model; this means that removing this channel from the model would result in a decrease of prediction accuracy of 40%—from 62% to 22%—thus EDA is highly relevant for that model. Different colors correspond to the different tasks being investigated. ACC: acceleration; BD: bipolar disorder; BVP: blood volume pulse; HC: healthy controls; HR: heart rate; MDD: major depressive disorder; TEMP: temperature; T0: current acute Diagnostic and Statistical Manual of Mental Disorders–5 affective episodes; T1: symptoms’ response; T2: symptomatic remission. Aim 2: Prediction of the Polarity of an Acute Affective Episode and Euthymia Among Different Individuals

The 7-class classification task (accuracy expected by chance: 1/7=14%) to predict the polarity of affective episodes and euthymia showed an accuracy of 70%. The best classifications were depressed and euthymic MDD, followed by depressed BD, and the worst was manic BD, followed by HCs. The generalization model showed an accuracy of 15.7% (slightly above chance). The classification task for 7 HCs showed an accuracy of 50% (). The confusion matrix is shown in . Thus, both models showed predictions above chance, but their generalization was poor. Moreover, the model including patients with acute affective episodes obtained higher accuracy (70%) than the model including 7 HCs (50%). This increased prediction capacity suggests that psychopathological symptoms during acute affective episodes may translate into physiological alterations that are not present in HCs.

The most relevant channels for predicting the polarity of affective episodes, euthymia, and HCs among different individuals (aim 2) were EDA, followed by ACC, HR, and TEMP (all channels showed >30% permutation importance). The BVP channel permutation importance was approximately 0%. These results were highly similar for the classification task of 7 HCs, but EDA showed only 4.9% permutation importance ().

Table 4. Prediction of the polarity of an acute affective episode and euthymia among different individuals: model and generalization on unseen patients.Individuals with affective episodes and performance metricModelGeneralization6 patients (acute affective episodes and euthymia) and 1 HCa
Accuracyb (%)7015.7
F1-score0.69270.1516
Precision0.68890.1513
Recall0.69340.1517
AUROCc0.69000.15107 HCs
Accuracyb (%)50—d
F1-score0.4923—
Precision0.4911—
Recall0.4988—
AUROC0.4998—

aHC: healthy control.

bAccuracy expected by chance for a 3-class classification task is 1/3=33%. Thus, accuracies above 33% suggest that the model can predict outcomes better than random guessing, and higher values for accuracy indicate better predictive capacity of the model. Note that the test set was designed to have the same number of samples in each class. This is reflected in the values of F1-score, precision, and recall being very close to each other and to that of accuracy.

cAUROC: area under the receiver operating characteristic.

dAs we were interested in predicting affective psychopathology, we tested the degree to which a model can generalize to different individuals for each task except for the one about distinguishing members of a group of only HCs.

Symptom Association With Physiological Data

The tile plots for the NMI between physiological data and the YMRS and HDRS scale items for the former intra-individual (aim 1) and between-individuals (aim 2) classification tasks are shown in and , respectively. TEMP had the highest association with psychometric scales (NMI approximately 1.0), and BVP had the lowest consistency (NMI scores oscillating from 0 to 1).

Figure 3. Tile plots for the normalized mutual information analysis between physiological data and psychometric scales’ items: intra-individual level. For each scales’ item the mutual information (MI) with respect to each of the channels was measured and scaled to 0 to 1 dividing by the maximum MI value for that item. Values of zero indicate no associations, values of 1 indicate the maximum recorded MI across all channels for an individual item. ACC_X: x-axis acceleration; ACC_Y: y-axis acceleration; ACC_Z: z-axis acceleration; BD: bipolar disorder; BVP: blood volume pulse; EDA: electrodermal activity; HDRS: Hamilton Depression Rating Scale; HR: heart rate; MDD: major depressive disorder; TEMP: temperature; YMRS: Young Mania Rating Scale. Figure 4. Tile plot for the normalized mutual information analysis between physiological data and psychometric scales’ items: between-individual level. For each scales’ item, the mutual information (MI) with respect to each of the channels was measured and scaled to 0 to 1 dividing by the maximum MI value for that item. Values of “0” indicate no associations; values of 1 indicate the maximum recorded MI across all channels for an individual item. ACC_X: x-axis acceleration; ACC_Y: y-axis acceleration; ACC_Z: z-axis acceleration; BVP: blood volume pulse; EDA: electrodermal activity; HC: healthy controls; HDRS: Hamilton Depression Rating Scale; HR: heart rate; TEMP: temperature; YMRS: Young Mania Rating Scale. Intra-individual NMI Analysis

Motor activity (ACC) channels were highly associated with manic symptoms (NMI>0.6), and stress-related channels (EDA and HR) with depressive symptoms (NMI from 0.4 to 1.0), as shown in .

Between-Individuals NMI Analysis

“Increased motor activity” (YMRS item 2 [YMRS2]) was associated with ACC (NMI>0.55), “aggressive behavior” (YMRS9) with EDA (NMI=1.0), “insomnia” (HDRS4-6) with ACC (NMI∼0.6), “motor inhibition” (HDRS8) with ACC (NMI∼0.75), and “psychic anxiety” (HDRS10) with EDA (NMI=0.52), as shown in .


DiscussionPrincipal Findings

Although other studies have used raw physiological data to predict mental health status, this is the first study to present a novel fully automated method for the analysis of raw physiological data from a research-grade wearable device, including a rules-based filter for invalid physiological data, whereas all other studies presented methods that required manual interventions at some point in the pipeline [,,,], thus hindering the replicability and scalability of results. Moreover, our preprocessing pipeline is strictly based on the best-performing algorithm for analysis (ie, not arbitrarily decided), whereas other studies decided arbitrary cutoff points for analyzing raw physiological data (eg, ACC data recorded at 32 Hz sampling rates analyzed arbitrarily in 1-min epochs []). Our method may allow other research teams to use a viable supervised learning pipeline for time-series analyses for a popular research-grade wristband []. In addition, our work integrates physiological digital data from all sensors captured by a research-grade wearable, and we assessed the relevance of each channel (ACC, TEMP, BVP, HR, and EDA) in the prediction models. In contrast, other studies have focused on specific digital signals, such as actigraphy [], or used combinations of digital signals (such as actigraphy and EDA) and predesigned features (eg, amplitude of skin conductance response peaks) [] but arbitrarily disregarded other digital signals, such as TEMP, or derived features, such as HRV. Furthermore, we aimed to distinguish the severity of mania and depression in a progressive and longitudinal manner according to the usual clinical resolution of mood episodes. We believe that the potential quantification of affective episodes is harder but a clinically more relevant task that may allow a more accurate and precise understanding of the disease rather than a mere dichotomous (acute vs remission) classification, as done in previous studies [,]. In addition, we included in the same work analyses at the intra-individual level and between different individuals, analyses targeting specific mood symptoms and generalization of the models on unseen patients. We believe that the use of different analysis methods allows us to examine the data from complementary perspectives to answer specific research questions. In addition, these different approaches may reveal random associations or artifacts that would stay hidden without replication. On the basis of these exploratory results, we propose hypotheses for future testing [] in current and other similar projects.

Note that both (1) intra- and (2) inter-individual analyses approach different research questions: the (1) intra-individual analytical approach looks at the course of an index episode within a single patient and examines whether different states (from the acute phase to response and remission) can be distinguished from each other; on the other hand, the (2) inter-individual analytical approach takes a cross-sectional view and studies the degree to which different mood disorder states (comprising the full spectrum from depression to mixed state, mania, and euthymia) can be separated. Both analyses try to identify digital biomarkers of illness activity using physiological data collected with a wristband. However, intra-individual analyses look for a fine-grained quantification of illness activity that may allow the identification of low-severity mood states (or prodromal phases) in comparison with moderate to severe ones. Conversely, inter-individual analyses could potentially distinguish between mood phases (mania vs depression) or cases from HCs but may not be suitable for assessing the severity of mood episodes, as represented in . Studies in similar areas, such as brain computer interfaces for the rehabilitation of motor impairments [] or seizure forecasting [], emphasized the importance of the subject-wise approach (modeling each subject separately). In many instances, despite work on domain adaptation [] to learn subject-invariant representations, a model has to be fine-tuned to the level of the single patient.

Figure 5. Severity versus Mood-Phase Classification Models: visual grounds for both intra- and inter-individual analyses. On the left, a severity classification model for a patient with depression (acute-response-remission phases). On the right, a mood-phase classification model (depression, mania, and euthymia). Note that on the left model, the same individual is compared at 3 different states (corresponding to a reduction in depressive psychopathology). Thus, individual-level characteristics (age, sex, and gait) should go through little to no variation across; should remain the same on the 3 longitudinal registers; and therefore, the shift in the covariate distribution should be relatively contained and not influence the classification of the model (capturing mood-relevant signals). In contrast, on the right, 3 different individuals at 3 different mood states are compared. In this case, the model would potentially distinguish between mood phases (mania vs depression), or cases from healthy controls, but may not be able to distinguish longitudinal changes in disease severity over the course of an index episode. In addition, in the latter model, subject-specific characteristics may be overlapped with mood-relevant signals, thus acting as confounders for the model. T0: current acute Diagnostic and Statistical Manual of Mental Disorders–5 affective episodes; T1: symptoms’ response; T2: symptomatic remission.

Studies comparing intra- and inter-individual models show that although intra-individual (cross-subject or patient-specific) models are trained on the data of a single subject, they perform better than intersubject (within-subject or generalized) models []. However, some studies have shown that hybrid models trained on multiple subjects and then fine-tuned on subject-specific data led to the best performance, without requiring as much data from a specific subject []. In intersubject studies, models generally see more data, as multiple subjects are included, but must contend with greater data variability, which introduces different challenges. In fact, there is both intra- and intersubject variability owing to time-variant factors related to the experimental setting and underlying psychological parameters. This impedes direct transferability or generalization among sessions and subjects []. To illustrate this, in a study aimed at evaluating a seizure detection model using physiological data and determining its application in a real-world setting, 2 procedures were applied: intra- and intersubject evaluation. Intrasubject evaluation focuses on the performance of the methodology when applied to data from a single patient, whereas intersubject evaluation assesses the performance of multiple patients with potentially different types of epilepsy and seizure manifestations [].

Notably, the out-of-sample generalizations of both models differ vastly. Whereas the intra-individual model requires multiple seizures recorded per subject and will produce individualized models tailored to a single patient, the inter-individual model requires seizures recorded from multiple participants and will provide intersubject models to be used over wider populations. For this purpose, intersubject variability plays a key role: focal seizures have a multitude of possible clinical manifestations that can occur in sequence or in parallel and can be repeated or not occur at all, in a single seizure. For instance, preictal tachycardia appears to be a phenomenon that is not generalizable to patient cohorts. Furthermore, although there may oftentimes be little change in the semiology of seizures for a single patient, they can be very heterogeneous across populations. Intra-individual models optimized for each patient can robustly detect seizures in some patients with epilepsy, but they may fail, especially when the seizures have differing semiologies that are not represented in the training data for the model. Intersubject models perform worse than if trained in an individualized manner, at least in terms of either sensitivity or false-alarm rates []. This is equivalent to a study aimed at evaluating a model for mood episode detection and determining its application in a real-world setting. During acute affective episodes, a huge combination of symptoms can be present in 2 d

留言 (0)

沒有登入
gif