Voice analyses using smartphone-based data in patients with bipolar disorder, unaffected relatives and healthy control individuals, and during different affective states

Study design and participants

The present study included data from two studies—the RADMIS trial (Faurholt-Jepsen et al. 2020) and the larger ongoing Bipolar Illness Onset study (BIO study) (Kessing 2017). Data were collected during the period from 2017 to 2020. All participants underwent The Schedules of Clinical Assessment in Neuropsychiatry (SCAN) interview (Wing et al. 1990) to confirm the clinical diagnosis of (or the lack of) BD.

The RADMIS trial

Patients with a diagnosis of BD who were hospitalized due to an affective episode and being discharged from one of five psychiatric centers at the Mental Health Services, Capital Region of Denmark, Denmark in the period from May 2017 to August 2019 were invited to participate in the RADMIS trial. Inclusion criteria: age above 18 years, BD diagnosis (ICD-10), discharge from a psychiatric hospital in The Capital Region of Denmark following an affective episode (depression, mania or mixed episode). Exclusion criteria: pregnancy and a lack of Danish language skills. In addition to standard treatment, patients were randomized with a balanced allocation ratio to either (1) daily use of a smartphone-based monitoring system (the Monsenso system—se description below) (the intervention group) or to (2) normal use of smartphones (the control group) during a 6 months follow-up period. Only patients from the intervention group providing smartphone-based data were included in the present study.

The BIO study

Three groups of participants were included in the BIO study: patients with newly diagnosed BD, UR, and HC.

Patients with BD

Inclusion criteria: a newly diagnosis of a single manic episode or BD (ICD-10) and ages between 15 and 70 years.

UR

UR, siblings or children, to the patients included in the BIO study, were recruited after permission from patients. Exclusion criteria: any previous or current psychiatric diagnosis lower than F34.0 (CD-10) (i.e., organic mental disorders, mental and behavioral disorders due to psychoactive substance use including alcohol, schizophrenia or other psychotic disorders, affective disorders).

HC

HC were recruited among blood donors, aged between 15 and 70 years, from the Blood Bank at Rigshospitalet, Copenhagen. Exclusion criteria: treatment requiring psychiatric disorder in the individual or one of the individuals’ first-degree family members. All participants in the BIO study were offered to use a smartphone-based monitoring system on a daily basis (the Monsenso system—see description below) during the study period.

Clinical assessments

Clinical evaluations of the severity of depressive and manic symptoms were conducted by a trained researcher using the Hamilton Depression Rating Scale 17-items (HDRS) (Hamilton 1967) and the Young Mania Rating Scale (YMRS) (Young et al. 1978).

Patient-reported smartphone-based data

A smartphone-based monitoring system (the Monsenso system) was installed on the participants own smartphones (both iPhone and Android smartphones). The smartphone-based monitoring system developed by the authors was used by the patients with BD on a daily basis to collect fine-grained real-time recordings of mood, activity, and sleep duration (Bardram et al. 2013). Mood was evaluated with scores on a 9-point scale ranging from depressed to manic (− 3, − 2, − 1, − 0.5, 0, 0.5, 1, 2, 3). Euthymia mood was defined a priori as a mood score of − 0.5, 0, 0.5. Depression was defined as mood score  < − 0.5, and mania was defined as mood score  >  0.5. Daily activity levels were rated on a 7-point scale (− 3, − 2, − 1, 0, 1, 2, 3) with 0 representing normal activity level. Sleep duration was calculated based on daily reports of bedtime and wake-up time. Insomnia was defined as total sleep duration  <  360 min. In addition, a broader definition of mania was made by combining increased mood (> 0.5), increased activity (> 0) and decreased sleep (< 360 min.).

Voice features

Voice features were collected from the participants’ phone calls (only Android smartphones) during their everyday life using the open-source Speech and Music Interpretation by Large-space Extraction (openSMILE vs. 2.1.0, Emo-Large) toolkit (Eyben et al. 2010; Schuller et al. 2010). The toolkit is a feature extractor for signal processing and machine learning applications, and it is designed for real-time processing. The toolkit used a built-in voice activity detection to live record voice samples from each incoming and outgoing phone call on the participants’ smartphone. The voice activity detection was run solely on the study participants’ onboard microphone such that the voice segments represented one recorded audio stream from the participant's voice. The audio stream was used to extract acoustic features ‘online’, e.g., directly on the study participants’ smartphones for each phone call. Voice samples were deleted locally on the smartphone after each phone call, and thus there was not access to any content related material from phone calls. The Emo-Large was a predefined set consisting of 6552 features, e.g., pitch, loudness, and energy, represented through various 1st level descriptive statistics including means, regression coefficients, and percentiles. The set has been found to be particularly relevant for classifying emotions (Pfister and Robinson 2010).

Statistical methods

Data were imported to and processed in Python (version 3.8) with packages sklearn (v. 0.23.2), imblearn (v. 0.7.0), and pandas (v. 1.1.4).

Aim 1 concerned the discrimination between patients with BD, UR, and HC based on the use of collected voice features. Aims 2 and 3 concerned the use of voice data from patients with BD to classify the symptom class labels within mood, energy, and sleep collected daily from smartphones, and a combination of the three.

For all analyses Random Forest (RF) classifiers were built to discriminate between classes (Breiman 2001). The RF classifiers combine several decision tree classifiers into a single classifier. A RF model uses the ensemble technique to yield a prediction from multiple independent decision tree classifiers. RF models were chosen as they generally can handle large number of features while being robust to overfitting. Each tree is generated from a subsample of the data and using a random subset of features to ensure maximal degree of independence among the trees. The classifier uses supervised learning, i.e., information of the group status/affective state, to build nodes that split the dataset into groups. These splits continue until the model either has a group with only a single class, or if further splits are unable to improve the classification. Call entries with missing voice feature values and features with identical values (i.e., zero variance) were removed.

All classifications were binary (e.g., patients with BD versus HC). For aim 2 and 3 patient-reported smartphone-based data for any specific day during the study period were included in the analyses if both voice features and patient-reported smartphone-based on mood, activity or sleep were available for the same day. We evaluated RF models on the resulting data set through a five-fold participant-based cross-validation. Five-fold cross-validation partitions the data in 5 parts of approximately the same size. Five to one partitions of the data were used to train the model, while the last partition was used to test the model, thereby evaluating the performance on unseen data samples. This was repeated 5 times so all samples were used for testing once to yield an average performance across all folds. We used a participant-based cross-validation version, where the test partition included participants that were not part of the training partition and vice versa. The participant-based method is particularly important for aim 1 since all voice data for each participant is identically labeled (i.e., either BD, UR, or HC). If the same participant is represented in the training and test partition the model would falsely learn to discriminate participant-based characteristics instead of clinical diagnose or state. Ad-hoc analyses without the participant-based cross-validation displayed significant better results. Therefore, to avoid learning on participant traits, all analyses included participant-based cross-validation.

In each cross-validation fold, the training set was used to calculate standardization parameters that transform the voice features training set to zero mean, unit variance. The calculated parameters were then applied to the test set. We used this standardization approach to create an unbiased data transformation invariant for factors such as gender, age, or microphone types selected by the phone vendors. As we used a participant-based cross-validation approach, the standardization was done for each voice feature across all participants.

Analyses concerning aims 2 and 3 were separated in two model types. First, a user-independent model that—as for aim 1—combines data from all participants in the same model. The model uses information from known participants to classify symptoms of unknown patients. Second, a user-dependent model personalized model for each patient was built.

We observed significant class imbalance in the data for all aims (e.g., fewer cases of symptoms of ‘mania’ compared with ‘euthymia’). Therefore, we applied a resampling process on the training data to balance the two classes. We did a combination with SMOTE oversampling (Chawla et al. 2002) of the minority class to represent 33% of the cases, followed by random under sampling of the majority class until the sample size was identical to the minority class. The combination of oversampling with SMOTE and under sampling has previously been shown effective to counter class imbalance (García et al. 2016). Without a resampling scheme, the RF classifier would favor overrepresented classes. However, resampling was only performed on the training data, to keep the test set class distribution representative for the collected data. In the cases where class distribution was less than 33% skewed, we only performed random under sampling.

Classifier performance

We applied several standard metrics for binary classification computed on a test set held out data and compared the results to a majority vote baseline model.

The metrics included a) ‘accuracy’ (defined as the number of correct classifications of the positive and negative cases divided by the total number of cases); (b) ‘F1-score’ (estimates the model’s ability to identify the positive class correctly, and was defined as the true positives divided by the true positives and the average between false positives and false negatives); (c) ‘sensitivity’ (defined as true positives divided by positives); (d) ‘specificity’ (defined as true negatives over all negatives); (e) ‘area under the characteristic curve’ (AUC) which is the area under the entire Receiver Operating Characteristic (ROC) curve. A ROC curve displays the model performance of sensitivity and specificity at all probability thresholds. The sensitivity and specificity reported in the tables are based on a threshold of 50%. An AUC value of 0.5 represents random guessing, while a value of one is a perfect classifier. To further strengthen performance interpretation a Bayesian inference framework with intrinsic priors was added (B10) (Leon-Novelo et al. 2012). The method handles unbalanced data well as proven through various simulated and real work examples (Olivetti et al. 2015). The measure is based on a statistical foundation through a test of statistical independence between, here, our predicted results and the actual symptom registered. Therefore, a direct standardized guideline exists. A value below 0 indicates a negative evidence for a statistically dependency, a value between 1 and 3 suggests a more positive indication, 3–5 a strong indication, while a value above 5 is a decisive indication of statistically dependence.

All classification metrics were computed within each cross-validation fold to yield a mean (M) and standard deviation (SD) value across all five-folds. In the personalized model we further averaged across all patients.

For aim 1, we ran a randomized permutation model (Berry et al. 2002) to test whether voice data from the three populations were statistically significantly different from each other. We randomly shuffled the class label for each participant and re-ran the entire RF classification. This was repeated 200 times to generate a non-parametric null-distribution of AUC scores (Fig. 1). Statistically significance was determined if the RF test AUC statistics with true class labels exceeds the null distribution with a significance level of p  = 0.05.

Fig. 1figure1

A generated null distribution of AUC values from a permutation test where the class labels (e.g., patients with bipolar disorder and healthy controls) are randomly shuffled 200 times and an AUC value for each permutation is plotted. The light grey region represents the critical area with the 5% largest values. The vertical lines represent the observed AUC values from the true class labels. A Generated null-distribution for the Random Forest classification of patients with bipolar disorder against healthy control individuals. B Generated null-distribution for the Random Forest classification of patients with bipolar disorder against unaffected relatives

For aims 2 and 3, we developed a majority vote model and a random classifier as a baseline. Unlike the RF model, the baseline models did not include voice data. Simply, in the majority vote model, the most frequently observed class label in the training data, was used to classify test data. In cases where there was an equal class distribution, the test data was classified at random. The random classifier used a uniform distribution to randomly choose a class label.

Ethical considerationsThe RADMIS trial

The RADMIS trial was approved by the Regional Ethics Committee in The Capital Region of Denmark and the data agency, Capital Region of Copenhagen (H-16046093, RHP-2017-005, I-Suite: 05365) and registered at ClinicalTrials.gov (NCT03033420).

The BIO study

The study protocol was approved by the Committee on Health Research Ethics of the Capital region of Denmark (protocol No. H-7-2014-007) and the Danish Data Protection Agency, Capital Region of Copenhagen (RHP-2015-023).

Both studies complied with the Declaration of Helsinki (Seoul, October 2008). All participants provided written informed consent. Data from smartphones were stored by Monsenso subject to a data management agreement between Monsenso and The Capital Region of Denmark.

留言 (0)

沒有登入
gif