Harnessing Consumer Wearable Digital Biomarkers for Individualized Recognition of Postpartum Depression Using the All of Us Research Program Data Set: Cross-Sectional Study


IntroductionBackground

Postpartum depression (PPD) is the most common complication of childbirth, occurring in approximately 1 in 7 women []. PPD can have several implications for women, manifesting in ways such as irritability, mood swings, fatigue, sleep and appetite disturbance, and thoughts of suicide []. Undetected PPD has also been shown to have financial implications for affected individuals as it can lead to challenges in maintaining employment or reduced work performance []. Furthermore, PPD has been linked to an elevated risk of mood disorders in the child as well as paternal depression [,].

Unfortunately, PPD remains significantly underdiagnosed and undertreated, as indicated by the strikingly low treatment rate of only 15% []. The current method of diagnosing PPD relies on screening instruments such as the Edinburgh Postnatal Depression Scale (EPDS), Center for Epidemiologic Studies Depression Scale, Patient Health Questionnaire, and Postpartum Depression Screening Scale, where the EPDS is the most commonly used instrument []. Often, women also need to undergo blood tests to assess thyroid function as the symptoms of PPD frequently overlap with hyperthyroidism []. Due to the challenges in diagnosing PPD, traditional approaches using these screening tools contribute to inadequate screening of women and subsequent underdiagnosis [,]. Therefore, the advent of new technologies is greatly needed to enable adequate and, hopefully, earlier detection of PPD.

Digital health tools have been gaining traction in recent years due to the near-ubiquitous ownership of smartphones []. Leveraging data passively collected by wearables (ie, digital biomarkers such as the average heart rate [HR], total steps, and calories burned per day) coupled with machine learning (ML) algorithms provides an opportunity to model the relationship between digital biomarkers and a particular disease for early recognition.

Prior Work

Previous studies have demonstrated that ML algorithms using digital biomarkers from smartwatches can predict cardiovascular diseases, infection, diabetes, and mental health conditions [-]. For example, one study demonstrated that a wearable device could estimate the changes in the severity of patients with major depressive disorder, where their findings indicated that ML models exclusively using digital biomarkers from wearables achieved moderate performance with correlation coefficients of 0.56 (95% CI 0.39-0.73) and 0.54 (95% CI 0.49-0.59) in the time-split and user-split scenarios, respectively, between model predictions and actual Hamilton Depression Rating Scale scores []. Another study recruited individuals with moderate depression for 4 weeks to develop individualized ML models based on digital biomarkers to predict mood. Their findings displayed a correlation between digital biomarkers and depression, as evidenced by high-performing models with a mean absolute error of 0.77 (SD 0.27) points on the 7-point Likert scale, which corresponds to a mean absolute percent error of 27.9% (SD 10.3%) []. A study by Wang et al [] found that students with higher depressive symptoms measured using the 8-item Patient Health Questionnaire were more likely to (1) use their phone at study locations (correlation coefficient [r]=0.39; P<.001) compared to all-day phone use (r=0.28; P=.01), (2) have irregular sleep time (r=0.30; P=.02) and wake time (r=0.27; P=.04) schedules, (3) be stationary for more time (r=0.37; P=.01), and (4) visit fewer places during the day (r=−0.27; P=.02). In addition, students with higher depressive symptoms measured using the 4-item Patient Health Questionnaire scores (1) were around a fewer number of conversations (P=.002), (2) slept for shorter durations (P=.02), (3) fell asleep later (P=.001), (4) woke up later (P=.03), and (5) visited fewer places (P=.003) over the previous 2-week period []. Other studies examining the association between digital biomarkers from wearables and depression include those by (1) Moshe et al [], who demonstrated a negative association between the variability of locations visited and depressive symptoms (β=−.21; P=.04) and a positive association between total sleep time and time in bed and depressive symptoms (β=.24; P=.02); and (2) Rykov et al [], who showed that a larger variation in nighttime HR between 2 AM and 4 AM (r=0.26; P=.001) and between 4 AM and 6 AM (r=0.18; P=.04) and lower regularity of weekday circadian activity based on steps (r=−0.17; P=.049) were associated with higher severity of depressive symptoms.

Additional research has been conducted related to understanding the relationship between wearable-derived digital biomarkers and PPD. For instance, one study showed that the features most predictive of maternal loneliness, which is commonly associated with PPD, were activity intensity, activity distribution during the day, resting HR, and HR variability []. It was also shown that women with milder depression symptoms typically had a larger daily radius of travel compared to those with more severe symptoms (2.7 vs 1.9 miles; P=.04) []. Finally, women with depression have been shown to have a lower HR variability (measured using the SD of 24-hour NN intervals, F=6.4; P=.01, and the SD of the averages of NN intervals in 5-minute segments, F=6.04; P=.02) and elevated HR while sleeping (F=5.05; P=.03) compared to women without depression [].

While these studies highlight a relationship between digital biomarkers and depression or PPD, they suffer from the following limitations: (1) some studies use data in the model that need active patient engagement with partnered mobile apps, where user retention is known to decrease over time with health-related apps; (2) most studies do not use a predictive framework but rather examine the association between digital biomarkers and depressive symptoms; (3) only one study has developed individualized ML models; (4) most studies analyzing women with PPD have limited time frames and do not capture continuous longitudinal data across different phases of pregnancy; and (5) no studies have developed individualized ML models for women in the postpartum period combining data from wearables and the electronic health record (EHR) []. Therefore, a method that provides continuous and personalized monitoring without the need for clinical encounters to enable early detection of mental health disorders, including PPD, is needed.

Goal of This Study

The All of Us Research Program (AoURP) is a comprehensive data set that collects several types of health-related data, including surveys, EHRs, physical measurements, and wearable data from Fitbit devices, with an emphasis on patient populations that have been previously underrepresented in biomedical research []. Currently, the longitudinal Fitbit data from >15,000 AoURP participants are made available to registered researchers on the All of Us Researcher Workbench, providing an opportunity to explore digital biomarkers in a diverse cohort of participants.

It is unknown whether digital biomarkers from consumer wearables can be used to detect PPD. In this study, we combined several orthogonal approaches demonstrating that digital biomarkers can be used for individualized classification of PPD with data collected from Fitbit using the AoURP (). This work demonstrated that (1) the integration of data sources, including EHR and wearable data, proves valuable for PPD recognition; (2) using longitudinal and continuous wearable data across various pregnancy phases supports ML model development; and (3) combining these integrated data sources facilitates the creation of individualized ML models, which may outperform cohort-based models. As such, our findings uncovered a novel method for recognizing PPD and serve as a framework that can be leveraged to facilitate early PPD detection. Moreover, the significance of this research underscores the promise of individualized ML models for detecting PPD, which can be applied to other mental health disorders.

Figure 1. An overview of the analysis workflow to evaluate the potential for digital biomarkers in postpartum depression (PPD) recognition. (1) Develop and perform computational phenotyping of PPD and non-PPD cohorts; (2) merge with available digital biomarker data for each woman (heart rate, steps, physical activity, and calories burned); (3) classify each day as 1 of 4 periods (prepregnancy period, pregnancy, postpartum period without depression, or PPD); (4) build and assess individualized ML models testing random forest, generalized linear models, support vector machine, and k-nearest neighbor algorithms; (5) validate the machine learning (ML) approach in women without PPD; (6) compare individualized model performance in women with and without PPD; (7) determine variable importance for PPD recognition; (8) generate Shapley additive explanations dependence plots to assess the relationship between digital biomarkers and PPD; and (9) compare individualized ML models versus a cohort-based model for PPD detection. EHR: electronic health record.
MethodsData Source and Platform

This study used the AoURP Registered Tier v6 data set. Study analysis was conducted using the AoURP Researcher Workbench cloud platform. All computational phenotyping, data processing, data analysis, and ML algorithms were conducted using R (R Foundation for Statistical Computing). Fitbit data collected in the AoURP adhere to a bring-your-own-device model, wherein participants who contribute their data are already in possession of a Fitbit device. The daily average HR, HR SD, minimum HR, quartile 1 HR, median HR, quartile 3 HR, and maximum HR were calculated using the Fitbit HR level table. The sum of steps was calculated using the Fitbit intraday steps table. Activity calories, calories burned during the basal metabolic rate (calories BMR), calories out, fairly active minutes, lightly active minutes, marginal calories, sedentary minutes, and very active minutes were taken from the Fitbit activity summary table. Day-level data were calculated for each of the 4 periods: prepregnancy period, pregnancy, postpartum period, and PPD (or PPD equivalent). All digital biomarkers included in this analysis are passively tracked by Fitbit; however, calories BMR is a calculated digital biomarker based on self-reported height, weight, age, and gender [].

Ethical Considerations

The protocol for the human participant research conducted was reviewed by the institutional review board of the AoURP (protocol 2021-02-TN-001). The institutional review board follows the regulations and guidance of the National Institutes of Health Office for Human Research Protections for all studies, ensuring that the rights and welfare of research participants are overseen and protected uniformly. Participants who contribute data to the AoURP have gone through an informed consent process with the option to withdraw at any time. Privacy is maintained by 1) storing data on protected computers, 2) researchers can’t see information to directly identify participants, such as name or social security number, 3) researchers sign a contract they won’t try to identify participants. Furthermore, the Researcher Workbench is only accessible to researchers through an institution with a signed Data Use Agreement and to researchers who complete the necessary training. If participants are asked (and decide) to go to an All of Us partner center for physical measurements to give blood, saliva, or urine samples, they are offered a one-time compensation of $25 in the form of cash, a gift card, or an electronic voucher.

In compliance with the Data and Statistics Dissemination Policy of the AoURP, counts of <20 cannot be presented to mitigate the risk of patient reidentification []. As the cohort of patients with PPD presented in this analysis comprised <20 patients, percentages were presented as percentage ranges (eg, instead of presenting the data as 53%, they were presented as 50%-55%). Publication of results in this manner has been approved by the AoURP Resource Access Board. Furthermore, race and ethnicity were not reported due to the limited sample size as requested by the AoURP Resource Access Board.

Computational PhenotypingIdentifying Women With PPD

Women with PPD were identified using the following three-fold approach: (1) selecting women with a diagnosis of PPD using the condition data and identifying women with a record of (2) pregnancy or (3) delivery who had been diagnosed with depression or had antidepressant drug exposure during the postpartum period.

The first branch of the 3-fold approach to creating a cohort of women with PPD was conducted using Observational Medical Outcomes Partnership concept IDs in the condition table based on the Observational Health Data Sciences and Informatics initiative in [,]. For both the second and third branches of the method, we first identified women with a record of delivery (using condition data) or pregnancy (using the condition and survey tables) based on concept IDs from previously published work in . Next, the data were filtered on the earliest record of delivery or pregnancy to capture and analyze digital biomarker data during the prepregnancy period. To estimate the date of pregnancy or delivery (depending on which was available for that individual), the date observed in the EHR from the AoURP was adjusted by adding or subtracting 9 months, which is a typical pregnancy duration []. Our next step was to estimate the window of the postpartum period, which was defined as starting from the date of delivery and spanning 24 months after that date, to monitor depressive symptoms [,]. Consistent with other EHR computational phenotyping studies of PPD, individuals were also classified as being PPD positive if they had a diagnosis of depression in the condition table or antidepressant drug exposure within the postpartum window [] (). Specific concepts containing the terms episode, remission, reactive, atypical, premenstrual, schizoaffective, and seasonal were excluded when identifying individuals with a depression diagnosis as they would not appropriately capture women with a persistent depression during the postpartum period. If a woman in the PPD cohort showed records of depression diagnosis and antidepressant drug exposure, we selected the earliest record to be considered the index date. For women with pregnancy and delivery data available, the index date and data used were based on the delivery record as this provided an elevated level of confidence in defining the postpartum period and, subsequently, whether the depression diagnosis or antidepressant drug exposure occurred during the postpartum period. Finally, the final PPD cohort was generated by selecting unique women from each of the 3 branches of our approach.

Identifying Women Without PPD

Women without PPD were selected as a control group to validate our approach because they experienced the same periods as women in the PPD cohort with the exception of having diagnosed or inferred PPD (see the previous section). Therefore, our modeling approach could be tested in an identical fashion (see more details about ML models in the section titled Individualized ML Models for Women Without PPD). To establish a cohort of women without PPD, we applied an identical rationale to that of the second and third branches of our PPD phenotyping, as described previously. Subsequently, women with records indicating PPD or depression diagnosis during the postpartum period from the condition table or any instances of antidepressant drug use from the drug exposure table were excluded.

Data Preparation for Analysis and Individualized ML Models

To prepare the data for analysis and individualized ML models using wearable data, we first merged day-level data from Fitbit (HR, steps, physical activity, and calories burned; see Table S1 in [-] for more information on digital biomarkers) for each individual ranging from 2 years before to 30 days after the index date to capture their behavior before, during, and after pregnancy. Previous studies have demonstrated that HR, steps, and activity measurements from Fitbit are fairly accurate and can be used for research purposes [,]. The decision to choose measures related to HR instead of resting HR was based on the availability of data and the consideration of having enough measurements for each individual to train ML models. Digital biomarker data were filtered on days of compliant data, which were characterized by (1) at least 10 hours of Fitbit wear time within a day and (2) between 100 and 45,000 steps, as seen in previous studies []. Individuals from the PPD cohort were excluded from individualized ML models if they had <50 days of total data.

Statistical AnalysisAssessing Variation in Digital Biomarkers Among Women

The lme4 and lmerTest packages in R were used to construct hierarchical linear regression models aiming to assess the presence of noteworthy differences among women and examine the relationship between each period and digital biomarkers [,]. To assess whether there was a significant level of variation in digital biomarkers among individuals, we processed data to calculate the average value of each digital biomarker during each period (eg, average HR during the prepregnancy period, average HR during pregnancy, average HR during the postpartum period, and average HR during PPD) and conducted linear mixed-effects models with person ID as the random effect. One model was built for each digital biomarker, where the digital biomarker served as the outcome variable, the period was considered the independent variable, and person ID was incorporated as a random effect. The presence of significant variability among individuals was evaluated using the performance package at a significance level of .05 [].

Interrupted Time-Series Analysis, Tukey Honest Significant Differences Test, and Digital Biomarker Directionality Assessment Between Periods

The interrupted time-series analysis (ITSA) was conducted using the its.analysis package in R with a significance level of .05 []. To compare whether there was a difference in digital biomarkers during different periods before, during, and after pregnancy, in addition to when patients experienced PPD, 4 periods were defined for each individual identified with PPD (prepregnancy period, pregnancy, postpartum period without depression [hereafter referred to as postpartum period], and postpartum period with depression [PPD]). The median duration of each period was 206 (IQR 154.50-313.50) days for the prepregnancy period, 258 (IQR 226-264) days for pregnancy, 42 (IQR 27.5-90) days for the postpartum period, and 42.5 (IQR 40.25-44.75) days for PPD. For each woman, a model was constructed for each digital biomarker, with 250 replications used for bootstrapping, which is a parameter of the itsa.model() function. Bootstrapping runs replications of the main model with randomly drawn samples and a trimmed median (10% removed); the F value is reported, and a bootstrapped P value is derived from it []. The dependent variable was the digital biomarker value, the time parameter was the date, and the interrupting variable was the period (prepregnancy period, pregnancy, postpartum period, and PPD). The mean and SD were calculated for each digital biomarker during each of the 4 periods for each woman. Furthermore, a Tukey honest significant difference (HSD) test was conducted to assess the statistical significance of the differences in each digital biomarker between each permutation of periods (PPD–prepregnancy period, PPD-pregnancy, PPD–postpartum period, postpartum period–prepregnancy period, postpartum period–pregnancy, and pregnancy–prepregnancy period) within each individual at a significance level of .05 []. Next, the percentage of women exhibiting a significant relationship was calculated for each digital biomarker in each group comparison (eg, PPD–prepregnancy period). To determine the overall trend in digital biomarker change between pairs of periods (eg, PPD and prepregnancy period, PPD and pregnancy, and PPD and postpartum period), the average difference across all individuals was computed for each digital biomarker. This average also included nonsignificant differences as they still contributed insights into the directionality of digital biomarkers during those periods even if the differences were not statistically significant. Finally, a 2-sided unpaired t test (2-tailed) at a significance level of .05 was conducted to assess the statistical significance of the net difference compared to 0, with positive change defined as an average value of >0 and negative change defined as an average value of <0. The outcomes were visualized in a heat map using the ggplot2 package in R. Percentages were represented as percentage ranges to preserve patient confidentiality, with the upper value of each range depicted in the heat maps (eg, 62% would fall within the 60%-65% range, and 65% would be displayed in the heat map).

Evaluating Health Care–Seeking Behavior

Health care–seeking behavior was assessed by looking at the number of visits recorded for each woman during the postpartum period (ie, ranging from the date of delivery to 30 days after the index date for each woman). The number of visits was determined by counting the number of rows in the visit occurrence table in the AoURP. We subsequently conducted an unpaired 2-sided Wilcoxon test with a significance level of .05 to determine whether the medians exhibited a significant difference between the PPD and non-PPD cohorts.

We also examined the proportion of women who adhered to the recommendation set by the American College of Obstetricians and Gynecologists, which advised women to attend at least one visit within the initial 6 weeks of the postpartum period. Of note, this guideline was updated in 2018 and now recommends a postpartum visit within the first 3 weeks following delivery []. However, we used the pre-2018 guideline in our analysis because the AoURP cohort includes individuals enrolled before 2018. The percentages of women who attended postpartum visits within the first 6 weeks in the PPD and non-PPD cohorts were compared using a 2-proportion z test at a significance level of .05. The exact percentage of women in the PPD cohort, in addition to the exact counts used to calculate the percentages, was obfuscated to maintain patient privacy.

Comparing Self-Reported and Gold-Standard Weight Measurements

Weight measurements were queried in AoURP using the measurements table (Observational Medical Outcomes Partnership concept ID 3025315). Self-reported and gold-standard weight measurements were distinguished by referencing the src_id column, indicating a physical measurement (self-reported) as opposed to measurements obtained from an EHR site (gold standard). Subsequently, we identified the self-reported and gold-standard weight measurements with the shortest time interval for each woman. Only measurements taken within a period of <30 days were considered to ensure that the measurements were closely aligned and not too distant. The median and IQR of self-reported and gold-standard measurements were calculated and compared using a paired 2-sided Wilcoxon test at a significance level of .05. This process was repeated in the PPD and non-PPD cohorts.

Comparing Weight Across Periods of Pregnancy

Weights across different periods of pregnancy (prepregnancy period, pregnancy, postpartum period, and PPD [or PPD equivalent for those without PPD]) were computed in the PPD and non-PPD cohorts using linear mixed-effects models in the lme4 package in R, with weight serving as the outcome variable, period as the independent variable, and person ID as the random effect. The results were evaluated at a significance level of .05. For women in the PPD cohort, the PPD period was used as the reference as it was the period of interest for understanding weight change. Similarly, the PPD-equivalent period was used as the reference for women in the non-PPD cohort. We further calculated the estimated means of weight across periods using the emmeans package in R for both the PPD and non-PPD cohorts.

Comparing Weight Retention in the PPD and Non-PPD Cohorts

To assess weight retention among women who experienced PPD compared to those without PPD, we first calculated the median weight of each woman during the prepregnancy period. Second, we identified the weight measurement during the postpartum period that was closest in value to the median prepregnancy weight on an individual basis. Third, the time difference in days was computed between the date of the weight measurement and the onset of pregnancy for each individual. Finally, we determined the median and IQR for the time difference in days mentioned in step 3 (ie, difference in days between the date of the weight measurement during the postpartum time period that was closest in value to the median prepregnancy weight for each individual) and subsequently conducted an unpaired 2-sided Wilcoxon test to assess the difference in medians at a significance level of .05 between women in the PPD and non-PPD cohorts.

Building ML ModelsIndividualized ML Models for Women in the PPD Cohort

Individualized ML models were developed with the objective of determining the potential of digital biomarkers to differentiate among 4 distinct pregnancy phases: prepregnancy period, pregnancy, postpartum period without depression (ie, postpartum period), and postpartum period with depression (ie, PPD). Specifically, we sought to assess whether we could develop ML models for each woman to make a prediction to classify a day of Fitbit data as falling during the prepregnancy, pregnancy, postpartum, or PPD period based on behavioral and biometric data captured by digital biomarkers on Fitbit. In other words, the models tested whether there was a unique digital signature associated with each period of pregnancy in an individualized manner. Therefore, multinomial models were developed with period as the outcome with all 16 digital biomarkers as the features in the model (see Table S1 in for a list of the digital biomarkers included). Initially, our intention was to examine the model’s capacity to discriminate between periods with and without PPD, thereby constructing binomial classification models. However, we recognized the hierarchical nature of the data with repeated measurements (multiple days of data) during the prepregnancy, pregnancy, and postpartum time frames. Consequently, due to the repetitive nature of our data, we opted for constructing multinomial ML models to effectively discern among the 4 identified periods, where the PPD period was treated as both a period and a diagnosis. We were then able to focus on the PPD period by (1) constructing a confusion matrix to assess model performance for the PPD period at an individual level and (2) performing variable importance (see the following Variable Importance sections) for the PPD period.

To build intraindividual models, the data were filtered on each woman, where they were considered PPD negative ranging from 2 years before to 15 days before the index date and PPD positive from 14 days before to 30 days after the index date. We selected 14 days preceding the index date as the first day of being positive for PPD because the criteria for diagnosis state that patients must display 5 depressive symptoms lasting 2 weeks []. The time frame of 30 days following the index date was chosen because some individuals in the PPD cohort received antidepressant medication on the day of their diagnosis, which can begin to take effect after approximately 4 weeks of use []. For each individual, the data were centered and scaled before building models using 3 repeats of 10-fold cross-validation and a tune length of 5 with random forest (RF), generalized linear models (GLMs), support vector machine (SVM), and k-nearest neighbor (KNN) as these algorithms have been used in previous studies assessing depression using wearables [,]. Of note, no bootstrapping was performed as part of the individualized ML workflow. Models were built using the Caret package in R and evaluated using a combination of the κ statistic and multiclass area under the receiver operating characteristic curve (mAUC), which are standard metrics for classification ML models [-]. Model performance for each period was further assessed using a confusion matrix, which calculated sensitivity, specificity, precision, recall, and F1-score [].

Comparing Individualized ML Model Performance Between Women With a History of Depression Before or During Pregnancy

To initially ascertain the presence of depression history before or during pregnancy within the PPD cohort, we determined the date of delivery (using condition data) or the date of pregnancy (using condition and survey data) based on the concept IDs detailed in . Depending on the available data for each woman, the date of pregnancy was calculated by subtracting 9 months from the date of delivery, whereas the date of delivery was calculated by adding 9 months to the date of pregnancy, representing a standard pregnancy duration []. In cases in which both delivery and pregnancy records existed, priority was given to the date of delivery due to its heightened reliability.

For the evaluation of individualized ML model performance within the PPD cohort concerning women with a history of depression, the cohort was categorized into four subgroups encompassing (1) no previous depression history, (2) depression before pregnancy, (3) depression during pregnancy, and (4) depression both before and during pregnancy. To examine potential disparities in individualized ML model performance, a 2-sided unpaired t test was conducted with a significance threshold of .05. This analysis was executed to compare the no-depression-history group with the groups of women exhibiting depression before, during, or both before and during pregnancy. Sensitivity, specificity, precision, recall, and F1-score metrics were subjected to this statistical comparison process.

Individualized ML Models for Women Without PPD

To construct individualized ML models for women in the non-PPD cohort, we implemented an analogous approach to the one used for women in the PPD cohort, where an ML model was built for each woman with period as the multinomial outcome. It is worth noting that women without PPD would not have a fourth period (ie, postpartum period with depression in women with PPD) as they did not experience PPD. To ensure comparability and effectively gauge model performance between women with and without PPD, we created a PPD-equivalent period for the non-PPD cohort mirroring the PPD period. Considering that the median time to diagnose PPD was found to be 83 days following delivery, we ensured uniformity by setting the index date of the PPD-equivalent period at 83 days after delivery. As we established an index date aligned with that of the PPD cohort, the interval of 14 days before the index date was not considered as the PPD-equivalent period for these women because they did not actually experience PPD. The goal was to validate any observed alterations in the PPD cohort by investigating whether there were any changes in the digital signature between the postpartum and PPD-equivalent periods, which should not exist given that these women did not experience PPD. Subsequently, individualized ML models were constructed in a manner akin to those in the PPD cohort using the RF algorithm (as this algorithm yielded optimal results in the PPD cohort) using 3 repetitions of 10-fold cross-validation and a tuning length of 5. Similar to the approach developed for women in the PPD cohort, model performance was evaluated using sensitivity, specificity, precision, recall, and F1-score [,]. Models were not assessed using mAUC or κ as model performance only decreased in the PPD-equivalent period and not in the prepregnancy, pregnancy, or postpartum periods compared to those in the PPD cohort.

Comparing Individualized ML Model Performance for Women in the PPD and Non-PPD Cohorts

For comparing the performance of individualized ML models in the PPD cohort to those in the non-PPD cohort, we performed a 2-sided unpaired t test with a significance level of .05.

Variable ImportanceShapley Additive Explanations Approach

We used the RF ML models to generate a ranking of digital biomarkers for each individual as these models had the best performance. Following that, Shapley values were computed for each measurement within each individualized model for the PPD class using the iml package in R []. To determine the feature ranking within individual models, we computed the average absolute Shapley values across all measurements for each digital biomarker and sorted the rankings from largest to smallest. We then tallied the number of models in which each biomarker ranked among the top 5 most predictive for the PPD class to produce an overall ranking of digital biomarkers. Furthermore, we determined the most predictive feature of PPD by totaling the number of models in which each digital biomarker ranked as the top predictor for the PPD class.

Permutation Approach

To enhance the robustness of our approach, variable importance was also computed using a permutation-based method in the Caret package in R []. Subsequently, the features were sorted based on the magnitude of values assigned for the variable importance regarding the PPD class. Using a similar methodology as with Shapley additive explanations (SHAP), we tabulated the number of models in which each digital biomarker ranked among the top 5 most predictive for the PPD class, yielding a comprehensive ranking of digital biomarkers. The frequency with which each feature ranked as the foremost predictive digital biomarker was also recorded for the PPD class.

SHAP Dependence Plots

SHAP dependence plots were generated using the gpplot2 package in R []. For each individual, plots were generated by graphing the Shapley value against the corresponding actual value for the digital biomarker. Given that the outcome of the models was multinomial (prepregnancy period, pregnancy, postpartum period, or PPD), 3 separate SHAP dependence plots were generated for each individual using calories BMR data during PPD with one other period (ie, one plot for the prepregnancy and PPD periods [referred to as prepregnancy vs PPD], one plot for pregnancy and PPD [referred to as pregnancy vs PPD], and one plot for the postpartum and PPD periods [referred to as postpartum vs PPD]) to more easily analyze the relationship between calories BMR in a binomial context between PPD and one other period. This process was repeated for women in the non-PPD cohort in a similar fashion to those in the PPD cohort, specifically, PPD-equivalent versus prepregnancy period (prepregnancy vs PPD-equivalent), PPD-equivalent versus pregnancy (pregnancy vs PPD-equivalent), and PPD-equivalent versus postpartum (postpartum vs PPD-equivalent). The Pearson correlation coefficient and its corresponding P value were computed at a significance level of .05, followed by calculating the percentages of women with and without a significant correlation. If a significant correlation was observed, we further determined its direction (positive or negative) and calculated the percentages of women with a positive or negative correlation. The overall consensus regarding the relationship was determined by comparing the percentage of positive and negative correlations for each digital biomarker across all individuals, thereby identifying which direction had a greater rate. In cases in which the proportion of women with a significant correlation was <40%, the direction was not assessed due to the small sample size, which may not be representative of the population.

Building an ML Model for PPD Using a Cohort-Based Approach

For the construction of an ML model that assessed whether a woman had PPD, our focus was on using the PPD and PPD-equivalent periods sourced from both the PPD and non-PPD cohorts. We proceeded to develop a binomial RF classification model in which 75% of individuals from each cohort were designated for the training set and the remaining 25% were assigned to the test set using the Caret package in R []. To ensure the reliability of model performance assessment, we diligently executed train and test set divisions based on individual person IDs, thereby preventing any overlap of women between the 2 sets that could potentially distort the results []. The model’s target outcome pertained to a binary classification of whether an individual exhibited PPD relying on all 16 digital biomarkers as input (refer to Table S1 in for a comprehensive description of the digital biomarkers used). The data were normalized through centering and scaling procedures. Notably, repeated cross-validation was omitted due to the presence of repeated measurements stemming from various person IDs. The model’s construction integrated a tune length of 5. The models were evaluated using the same κ and area under the receiver operating characteristic curve metrics (not multiclass in this instance as the outcome was binary). Subsequently, a confusion matrix was generated to calculate sensitivity, specificity, precision, recall, and F1-score [-].


ResultsDescriptive Statistics

Through computational phenotyping in the AoURP, a patient cohort of women who gave birth with PPD (n<20) and without PPD (n=39) provided valid Fitbit data (). The median age in the PPD cohort was 35.60 (IQR 32.83-37.36) years compared to that in the non-PPD cohort, which was 33.60 (IQR 30.72-35.56) years. The median and IQR were calculated for each digital biomarker across all women in the PPD and non-PPD cohorts (). In both the PPD and non-PPD cohorts, we computed the median number of days with digital biomarker data during the prepregnancy, pregnancy, postpartum, and PPD (or PPD-equivalent) periods and the corresponding IQRs (additional details about the PPD-equivalent period, a similar fourth period for those without PPD, can be found in the Methods section; ). Briefly, the digital biomarkers included in this analysis were daily average HR, HR SD, minimum HR, quartile 1 HR, median HR, quartile 3 HR, maximum HR, sum of steps, activity calories, calories BMR, calories out, fairly active minutes, lightly active minutes, marginal calories, sedentary minutes, and very active minutes (see the descriptions in Table S1 in ).

Figure 2. A schematic of postpartum depression (PPD) computational phenotyping. Table 1. Descriptive statistics of the postpartum depression (PPD) and non-PPD patient cohorts in the All of Us Research Program.Descriptive statisticsPPD (n<20), median (IQR)Non-PPD (n=39), median (IQR)Age (y)35.60 (32.83-37.36)33.60 (30.72-35.56)Digital biomarker
Average HRa (bpm)74.23 (68.36-80.66)78.31 (72.32-83.97)
HR SD (bpm)12.18 (10.58-14.05)12.70 (10.72-15.12)
Minimum HR (bpm)54.00 (49.00-60.00)57.00 (52.00-61.00)
Quartile 1 HR (bpm)64.00 (59.00-71.00)68.00 (62.00-74.00)
Median HR (bpm)72.00 (66.00-78.00)76.00 (70.00-82.00)
Quartile 3 HR (bpm)81.00 (74.00-88.00)85.00 (78.00-92.00)
Maximum HR (bpm)124.00 (117.00-135.00)127.00 (119.00-141.00)
Sum steps7567.50 (4884.00-10536.25)7352.00 (4838.00-10834.00)
Activity calories989.00 (742.75-1263.00)964.00 (684.00-1275.00)
Calories burned during BMRb1466.00 (1379.00-1539.00)1390.00 (1340.00-1496.00)
Calories out2236.00 (2012.00-2483.25)2180.00 (1925.00-2465.50)
Fairly active minutes9.00 (0.00-24.00)8.00 (0.00-23.00)
Lightly active minutes245.00 (189.00-315.00)245.00 (187.50-310.00)
Marginal calories501.00 (349.00-665.00)489.00 (322.00-680.00)
Sedentary minutes646.00 (563.00-741.00)710.00 (607.00-880.50)
Very active minutes2.00 (0.00-18.00­)4.00 (0.00-21.00)Number of days in each period
Prepregnancy period206.00 (154.50-313.50)227.00 (109.50-340.75)
Pregnancy258.00 (226.00-264.00)221.00 (129.00-269.50)
Postpartum period42.00 (27.50-90.00)72.00 (46.00-82.00)
PPD42.50 (40.25-44.75)29.00 (14.50-31.00)

aHR: heart rate.

bBMR: basal metabolic rate.

Digital Biomarker Comparison Across Periods of Pregnancy Revealed Altered Profiles and Heterogeneity Among Women

Because of the known heterogeneity in depressive symptoms, we hypothesized that variability in digital biomarkers may exist across individuals in the PPD cohort []. To test this hypothesis, we conducted linear mixed-effects models for each digital biomarker in women with PPD, where we found that the random effect of person ID was significant (P<.001) for all digital biomarkers, suggesting meaningful variability across individuals (Table S2 in ). These results, coupled with a smaller cohort sample size, prompted us to perform subsequent analyses using an intraindividual approach.

In women with PPD, we next sought to compare whether there was a difference in digital biomarkers across different periods of pregnancy: prepregnancy period, pregnancy, postpartum period, and PPD (where PPD represents both a period and a diagnosis). Therefore, an intraindividual ITSA and Tukey HSD test were conducted for each digital biomarker. Because of the physiological changes associated with pregnancy, such as increases in blood and stroke volume, in addition to the behavioral fluctuations that occur during PPD, such as a loss of energy and psychomotor retardation, we hypothesized that all digital biomarkers (those related to HR, steps, physical activity, and calories burned) would be altered across the prepregnancy, pregnancy, postpartum, and PPD periods [,-]. ITSA results supported our hypothesis and demonstrated a significant difference in all digital biomarkers across periods in most women with PPD (Table S3 in ). Consistent with ITSA findings, Tukey HSD results showed that several digital biomarkers were significantly altered between PPD and other periods (prepregnancy, pregnancy, and postpartum periods; ). We further observed various trends in digital biomarkers between pairs of periods (ie, PPD and prepregnancy period, PPD and pregnancy, and PPD and postpartum period; and Table S4 in ).

Figure 3. Digital biomarkers vary across different periods of pregnancy among women with postpartum depression (PPD). The percentage of women in the PPD cohort exhibiting a significant difference in digital biomarker values between each pair of periods (left [represented by 0-100]) and the direction of their relationship (right). The x-axis illustrates a comparison of Tukey honest significant differences (HSD) between 2 periods of interest, representing the subtraction of digital biomarker values between the first and second periods. Tukey HSD tests were individually conducted for each woman’s data, and the percentage showing a significant relationship was calculated and presented on the heat map. The heat map on the right illustrates the overall relationship between the digital biomarker during the 2 periods of interest among the women who exhibited a significant relationship (as indicated by the percentage shown on the left heat map), with the period listed second serving as the reference. In summary, the findings indicated that digital biomarkers undergo significant alterations across different periods of pregnancy on an individual basis. Calories BMR: calories burned during the basal metabolic rate; HR: heart rate; NS: not significant; Q1: quartile 1; Q3: quartile 3. Individualized ML Models Effectively Differentiated PPD From Alternative Periods of Pregnancy

Having seen that digital biomarkers were significantly altered across multiple periods of pregnancy in women with PPD, we surmised that individualized multinomial ML models could accurately distinguish between our 4 periods of pregnancy (prepregnancy period, pregnancy, postpartum period, or PPD; and Tables S3 and S4 in ). Therefore, we sought to assess whether ML models for each woman could accurately classify an unknown day of Fitbit data as falling during the prepregnancy, pregnancy, postpartum, or PPD period based on behavioral and biometric data captured by digital biomarkers on Fitbit. In essence, the models examined whether there existed a distinct digital signature linked to each pregnancy period in an individualized fashion. To probe this hypothesis, intraindividual ML models were generated using RF, GLM, SVM, and KNN to conclude which algorithm would yield the best-performing results. Models were assessed using a combination of the mAUC and κ, which are 2 frequently used metrics [,]. After averaging the mAUC for individual models within each algorithm, the results revealed that RF models performed the best, followed by GLM, SVM, and then KNN, with an average mAUC of 0.85, 0.82, 0.75, and 0.74, respectively (). Assessing models in a similar fashion using another metric, κ, yielded concordant results for RF (0.80), GLM (0.74), SVM (0.72), and KNN (0.62) model performance, suggesting that the RF algorithm had the best performance and should be used going forward ().

Table 2. Individualized random forest (RF) models exhibited the best performance for multinomial period classification.AlgorithmmAUCa, mean (SD)κ, mean (SD)Random forest0.85 (0.09)0.80 (0.15)Generalized linear model0.82 (0.09)0.74 (0.16)Support vector machine0.75 (0.10)0.72 (0.16)k-nearest neighbor0.74 (0.10)0.62 (0.19)

amAUC: multiclass area under the receiver operating characteristic curve.

As our analysis aimed to assess the potential of digital biomarkers for personalized classification of PPD, we sought to further examine each RF model’s performance via a confusion matrix. Thus, the average sensitivity, specificity, precision, recall, and F1-score were calculated across all individual models, where the results for the PPD class were 0.79, 0.95, 0.84, 0.79, and 0.81, respectively (Figure S1 in ). The same metrics for the prepregnancy, pregnancy, and postpartum periods were also calculated (Figure S1 in ).

To ensure the widespread applicability of these algorithms to a diverse range of women, we did not exclude individuals with a history of depression either before or during pregnancy. Therefore, we sought to determine whether having depression before or during pregnancy impacted individual model performance, specifically for recognizing the PPD class. To answer this question, we computed the average sensitivity, specificity, precision, recall, and F1-score within the group of women experiencing PPD categorized based on their depression history: (1) no previous history of depression, (2) history before pregnancy, (3) history during pregnancy, or (4) history both before and during pregnancy. Notably, the findings revealed no statistically significant variations in any of these metrics between women with a history of depression during the prepregnancy or pregnancy periods and those without such a history (Figure S2 in ). Promisingly, this suggests the potential for a forthcoming technology focused on detecting PPD through digital biomarkers to be relevant for women with or without a previous history of depression before or during pregnancy.

Individualized ML Models for PPD Recognition Were Specific

To validate our approach of using digital biomarkers in individualized ML models for PPD detection, we aimed to test our strategy in a cohort of women who had given birth but did not experience PPD. We chose women without PPD as a control group for validation because they experienced the same 3 phases of pregnancy (prepregnancy period, pregnancy, and postpartum period) as women in the PPD cohort with the exception of PPD. Given that women without PPD did not have a distinct PPD-specific period as observed in the PPD cohort, we introduced a fourth time segment in the non-PPD cohort (the PPD-equivalent period). Following the same ML pipeline as for the PPD cohort, individualized RF models were built for women in the non-PPD cohort. If our conjecture held, we anticipated observing elevated model metrics during the prepregnancy and pregnancy periods followed by diminished performance in the postpartum and PPD-equivalent time segments. This expectation arose from the idea that digital biomarkers remain unaltered during the postpartum and PPD-equivalent periods, resulting in the model’s inability to differentiate between them.

In line with our hypothesis, the sensitivity, specificity, precision, recall, and F1-scores substantiated that ML models effectively identified the prepregnancy (0.89, 0.91, 0.88, 0.89, and 0.88, respectively) and pregnancy (0.85, 0.91, 0.87, 0.85, and 0.86, respectively) time intervals through digital biomarkers (). When compared to model performance in the prepregnancy and pregnancy periods, there was no significant reduction in model performance during the postpartum period (0.74, 0.96, 0.76, 0.74, and 0.75, respectively); however, a noticeable decline in performance was observed during the PPD-equivalent period (0.52, 0.99, 0.69, 0.52, and 0.61, respectively; ). To further assess potential variations in the classification performance between the PPD and PPD-equivalent periods, we carried out a t test comparing the average sensitivity, specificity, precision, recall, and F1-score between the PPD and non-PPD cohorts for these periods. The findings indicated a statistically significant decrease in sensitivity, precision, recall, and F1-score when predicting the PPD-equivalent period in the non-PPD cohort as opposed to predicting the PPD period in the PPD cohort (). On the other hand, specificity remained largely unchanged (). The decrease in performance among individualized ML models in the PPD-equivalent period implies that the models were unable to accurately classify the PPD-equivalent period, which was expected as there was no actual distinction between the postpartum and PPD-equivalent periods for these women. Collectively, these outcomes helped demonstrate the specificity of our approach in identifying PPD, reinforcing the agreement that personalized models using digital biomarkers can indeed effectively recognize PPD.

Table 3. Machine learning (ML) models did not accurately detect the postpartum depression (PPD)–equivalent period in women without PPD.Time period and metricValue, mean (SD)Prepregnancy period
Sensitivity0.89 (0.15)
Specificity0.91 (0.10)
Precision0.88 (0.09)
Recall0.89 (0.15)
F1-score0.88 (0.13)Pregnancy period
Sensitivity0.85 (0.12)
Specificity0.91 (0.06)
Precision0.87 (0.07)
Recall0.85 (0.12)
F1-score0.86 (0.09)Postpartum period
Sensitivity0.74 (0.20)
Specificity0.96 (0.04)
Precision0.76 (0.16)
Recall0.74 (0.20)
F1-score0.75 (0.18)PPD-equivalent period
Sensitivity0.52 (0.33)
Specificity0.99 (0.03)
Precision0.69 (0.28)
Recall0.52 (0.33)
F1-score0.61 (0.30)

留言 (0)

沒有登入
gif