Data-driven identification of post-acute SARS-CoV-2 infection subphenotypes

EHR data repositories

Two large-scale, de-identified, real-world EHR data warehouses were used in our analyses. Our first cohort data were based on the EHR from the INSIGHT CRN14, which contains the longitudinal clinical information of approximately 12 million patients in the NYC area. Our second cohort data were based on the EHR from the OneFlorida+ CRN15, which contains the information of nearly 15 million patients mainly from Florida and selected cities in Georgia and Alabama.

This study is part of the National Institutes of Health (NIH) Researching COVID to Enhance Recovery (RECOVER) Initiative31, which seeks to understand, treat and prevent the post-acute sequelae of SARS-CoV-2 infection (PASC).

Potential PASC conditions

We systematically reviewed the literature on studies related to PASC or Long-COVID and compiled a comprehensive list of potential PASC conditions in ICD-10 codes based on the one provided in Al-Aly et al.10.

Then, our clinician team carefully reviewed the list and removed the codes that would not plausibly be considered as PASC in the adult population (for example, HIV, tuberculosis, infection by non-COVID causes, neoplasms, injury due to external causes, etc.) and systematically added the parent codes (for example, the first three digits of ICD-10 codes). Finally, these ICD-10 codes were mapped to 137 condition categories according to CCSR version 2022.1 as Al-Aly et al.10 did. We added these clarifications and provide the detailed code list in Supplementary Table 2.

Cohort construction

For both cohorts, adult patients (age ≥20 years) with at least one SARS-CoV-2 polymerase chain reaction (PCR) or antigen laboratory test (Supplementary Table 11) between 1 March 2020 and 30 November 2021 were included. We further required these patients to have at least one diagnosis code in the baseline period and at least one diagnosis 30–180 days after index date to confirm healthcare utilization capture. Then, we chose the patients who had at least one positive test and had at least one potential PASC condition in the follow-up (or post-acute infection) period. We further made sure that those potential PASC conditions were new incidences in the follow-up period by excluding patients who had any of them in both baseline and follow-up periods. The overall inclusion–exclusion cascade is shown in Extended Data Fig. 9, and the relevant definitions are provided below.

Index date: the date of the first COVID-19-positive test.

Baseline period: from 3 years to 1 week before the index date.

Follow-up (post-acute infection) period: from 31 days after the index date to the day of documented death, the last record in the database, 180 days after baseline (we followed the definition provided by the NIH and the Centers for Disease Control and Prevention (CDC)10,22,31,32) or the end of our observational window (30 November 2021), whichever came first.

Statistical analysisTM

We use binary vectors \(\left\}}}_n} \right\}_^N\) to represent the patients, where n is the patient index the i-th element of xn, or xn(i) = 1 if the i-th potential PASC condition appears in the post-acute infection period of the n-th patient’s EHR, otherwise xn(i) = 0. Therefore, each xn is a 137-dimensional binary vector (step 1 in Fig. 1). TM19 was then applied on these vectors to learn a set of potential PASC topics. PASC topics in this context refers to a group of conditions that present together based on their incident probabilities. Specifically, assume that each patient can be represented as a mixture of K latent PASC topics Φ∈RD×K, where D = 137 is the number of unique PASC conditions and K is the total number of topics. Each topic φk (\(k:1 \le k \le K\) is the index of the topic) can be viewed as a set of PASC that are more likely to be co-incident in the post-acute infection period of a particular patient (step 2 in Fig. 1). Then, for each patient, TM infers the mixture memberships θn∈RK, also called topic proportions or topic loadings, as the new representation for each patient (step 3 in Fig. 1). A patient with higher loading value on a particular topic indicates that he/she suffers from more co-incident condition patterns from this topic. In other words, TM transforms the representations of each patient from the original 137-dimensional binary space xn to the low-dimensional continuous PASC topic space θn, which will be leveraged further for subphenotype identification through clustering later (step 4 in Fig. 1). Specifically, we used Poisson factor analysis (PFA)17 as the concrete TM method, which generates xn as follows.

Draw a topic proportion θn for the n-th patient from a Gamma distribution θn ~ Gamma(1,1);

Draw the k-th PASC topic from a Dirichlet distribution φk ~ Dirichlet(0.01), \(k = 1, \cdots ,K\);

Draw a binary vector xn from the Bernoulli distribution by Bernoulli–Poisson link33,34,35,36:

$$}}}_n = 1\left( }}}_n \ge 1} \right),}}}_}}} \sim Poisson\left( }}}}}}_n} \right);$$

where \(1\left( \cdot \right)\) is an indicator function representing xn = 1 if un ≥ 1 and xn = 0 if un = 0. The input of the PFA model is the observed patient binary vector \(\left\}}}_n} \right\}_^N\). The learnable parameters of the PFA model are the topic proportion vectors \(\left\}}}_n} \right\}_^N\) and the topic matrix Φ. Because the PFA is a fully conjugated probabilistic model, we trained it by Gibbs sampling with the conditional posterior of the learnable parameters, whose specific equations can be found in Zhou et al.17. We used the Python package Pydpm version 3.0.1 (https://pypi.org/project/pydpm/) to train the model and released the code.

For PFA, we imposed Dirichlet as the prior distribution for k-th PASC topic so that it lies in a probability simplex space (elements in φk represent the probability of each PASC appearing in k-th topic). Compared to another popular TM method—latent Dirichlet allocation (LDA)29 that imposes Dirichlet distribution on the topic proportions θn—PFA uses Gamma distribution that can obtain more separable compressed representations17,35, which is useful for identifying more meaningful subphenotypes.

The number of topics, K, is an important parameter in TM. To determine an optimal K based on the data, we used two metrics: data likelihood and topic coherence37. Data likelihood is used to evaluate the fitness of the model on the current dataset, where the larger value indicates better fitness. Topic coherence is used to evaluate the relevance of our learned PASC topics to the investigative condition list, where the value is from 0 to 1, and higher value indicates better coherence. Detailed calculations of these two metrics are provided below.

Data likelihood

In PFA, to model the binary PASC vector, we used the Bernoulli–Poisson link as:

$$}}}_n = 1\left( }}}_n \ge 1} \right),}}}_}}} \sim Poisson\left( }}}}}}_n} \right);$$

where xn is the binary PASC vector, Φ is the topic matrix, θn is the topic proportion vector and un is the latent variable that links the binary observation and topic representation. According to the property of Bernoulli–Poisson link, un can be marginalized out, and then one can obtain a Bernoulli likelihood as

$$}}}_n \sim Ber\left( }}}^}}}}}}_n}} \right).$$

Given \(\}}}_n\} _^N\), after learning Φ and \(\}}}_n\} _^N\), we can calculate the Bernoulli data likelihood. In Extended Data Fig. 2, we show the mean (average of the number of patients) data log-likelihood:

$$\frac\mathop \limits_^N \limits_^V log\left( } \right) + \left( } \right)log\left( } \right)} } ,$$

where N is the number of patients and V is the total number of PASC, and \(\hat x_\) is the averaged reconstruction item. Because each iteration of Gibbs sampling provides only a point estimation, to obtain the likelihood of the distribution estimation in practice we collect the parameters of every iteration after burn-in step (500 burn-in steps in our experiment) and then take the average of them (Zhou et al.17), defined as \(}}}_n = \frac\mathop \nolimits_^Q }}}^}}}^}}} \right)}}}}_n^}}\), where Q is the total number of collected parameters (Q = 500 in our experiment).

Topic coherence

Topic coherence is an important metric to evaluate the quality of topics based on the input data. It is measured based on a sliding window (in our case, the length of the window is the total number of PASCs), and then one can calculate the normalized pointwise mutual information (NPMI) between input data and the learned topics. We used the Python package GENSIM (https://radimrehurek.com/gensim/) to calculate the topic coherence.

The results are shown in Extended Data Fig. 2. From this figure, we can see that more topics can provide higher data likelihood because we have more topics to represent the original PASCs. However, more topics may bring down topic coherence, which suggests the redundancy between the newly added ones and the old ones. With these considerations, we set the final number of topics as 10 for both INSIGHT and OneFlorida+ cohorts, as it achieved the best topic coherence and reasonable data likelihood (we do not want the data likelihood to be too perfect, as that may suggest overfitting).

As TM is a probabilistic process, we want to guarantee the robustness of the identified topics. To achieve this goal, we first did 1,000 bootstrapping (randomly sample the same number of patients with replacement) to learn topics. Then, according to the importance of each topic (mean topic loadings over all patients), we reordered the topics and calculated the cosine similarity among all topics from each pair of bootstrapping:

$$S_ = \frac}\mathop \limits_^ }} \mathop \limits_^ \left( }}}_p^,}}}_q^} \right),$$

where \(}}}_p^\) is the p-th topic vector learned from the i-th bootstrapped samples, and Spq is the similarity between the p-th topic and the q-th topic. Extended Data Fig. 3a,b demonstrates the heat map of the similarity matrix with Spq as its (p,q)-th entry, from which we can clearly observe a darker diagonal line, which indicates a high similarity between the topics learned from different bootstrapped samples and, thus, implies that the learned topics are robust.

We also quantitatively evaluated the consistency between the two set of topics learned from different cohorts. Specifically, denoting the topic matrices learned from the two cohorts as Φ(1) and Φ(2), then we can evaluate the consistency between the i-th topic in cohort 1 and the j-th topic in cohort 2 as the cosine similarity between their corresponding topic vector \(}}}_i^\) and \(}}}_j^\) as:

$$S_ = cos\left( }}}_i^,}}}_j^} \right),i,j = 1, \cdots ,K,$$

Finally, the heat map of the topic consistency matrix with Sij as its (i,j)-th entry was shown in Extended Data Fig. 5, from which we can clearly observe darker diagonal values, which suggests high consistency between the two sets of topics. In addition, the topic consistency learned from the positive and negative cohorts is provided as results in Extended Data Fig. 3c,d, which demonstrates low consistency between two sets of topics from the positive and negative cohorts.

We calculated the entropy of each topic to quantitatively evaluate the concentration patterns of each topic. Specifically, because each topic is a probabilistic distribution over the PASC conditions, we can use the following definition of entropy to evaluate whether each topic is closer to a uniform distribution or concentrates on some related PASC.

$$H\left[ }}}_k} \right] = - \mathop \limits_^ \times log_2\varphi _} ,$$

where φki is the i-th element in φk. In our case, H[φk] should be in a range of (0, 7.0980). Smaller H[φk] denotes that the k-th topic more likely concentrates on particular PASC conditions (capturing stronger co-occurrence pattern of PASC), whereas larger H[φk] denotes that the k-th topic is closer to a uniform distribution. From Supplementary Table 6, we observed that those topics learned from the positive cohort had lower entropy values than those learned from the negative cohort.

Subphenotyping through clustering

With the learned K-dimensional topic loading vector for n-th patient θn, we applied the hierarchical agglomerative clustering method with Euclidean distance calculation and Ward linkage criterion38 to derive subphenotypes as patient clusters. For determining the optimal number of clusters (subphenotypes), according to Su et al.39 and Xu et al.40, we applied the NbClust R package41, which includes 21 cluster indices to evaluate the quality of clusters. With the patients from the INSIGHT and OneFlorida+ CRN, 13 and 12 out of the 21 indices agreed that four is the optimal number of clusters (Supplementary Table 12). Through majority voting, we set the number of clusters as four in both two cohorts. Extended Data Fig. 10 demonstrates the uniform manifold approximation and projection (UMAP) embeddings42 and dendrogram of these clusters for both cohorts.

We also examined the robustness of the identified clusters on both cohorts. Specifically, for each cohort, we used the subphenotypes derived from all patients as references, so that the subphenotype index for the n-th patient is denoted as yn. Then, we ran 1,000 bootstrapping (randomly choose 80% patients from the cohort denoted as set \(}_}}},i = 1, \cdots ,1000\)) to learn the topic model and then derive subphenotypes with the same procedure as described above. For each bootstrapped sample set i, we can obtain another subphenotype index for n-th patient from Ωi as \(\hat y_\). We calculated the mean and 95% confidence interval (CI) of Adjusted Rand Index (ARI) score and normalized mutual information (NMI) between the clustering results on bootstrapped sample sets and reference as 0.902 (95% CI: 0.863–0.927) and 0.937 (95% CI: 0.908–0.952) for the INSIGHT cohort and 0.914 (95% CI: 0.907–0.929) and 0.950 (95% CI: 0.936–0.968) for the OneFlorida+ cohort, which suggests that the identified clusters are highly robust.

Comparison with patients without SARS-CoV-2 infection

We compared the co-incidence patterns of the investigative conditions in the follow-up periods for patients with SARS-CoV-2 infection testing positive and negative. The patients who tested negative for SARS-CoV-2 all had negative results from lab tests between March 2020 and November 2021, and there were no documented COVID-19-related diagnoses during this time period. The index date for each individual patient in the non-infected group is defined as the date of the first (negative) lab test.

To make fair comparisons, we performed similarity matching to identify appropriate negative patients for each positive patient based on the following hypothetical confounding variables.

Demographics: age, gender, race and ethnicity, where age was binned into different groups (20–<40 years, 40–<55 years, 55–<65 years, 65–<75 years, 75–<85 years and 85+ years).

The ADI (ten rank bins of national ADI) for capturing the socioeconomic disadvantage of patients’ neighborhood18.

Index date for considering the effect of different stages of the pandemic, which was binned into different time intervals (March–June 2020, July–October 2020, November 2020 to February 2021, March–June 2021 and July–November 2021).

Medical utilizations measured by numbers of inpatient, outpatient and emergency encounters in the baseline period (binned into 0 visit, 1 or 2 visits, 3 or 4 visits and 5+ visits for each encounter type).

Coexisting conditions, including comorbidities and medications based on a tailored list of the Elixhauser comorbidities43. We defined the patient having a particular condition if he/she had at least two related records during the baseline period.

For identifying the negative controls for each patient in a particular subphenotype, we first required exact match for confounders of demographics, ADI and index date to obtain an initial set, and then we performed robust PS matching on other hypothetical confounders to rank the patients in the initial set, and we finally picked the top two. We used standardized mean difference (SMD) to quantify the goodness-of-balance of confounders between two groups \(SMD\left( \right) = \frac \right] - E\left[ \right]} \right|}} \of \right) + Var\left( \right)} \right)/2}}}}\), where SMD < 0.2 is the threshold to examine whether this confounder is balanced44. On both INSIGHT and OneFlorida+ cohorts, we found that all confounders on all subphenotypes were balanced.

Shortening the PASC condition list with high-dimensional PS adjustment

We leveraged our high-dimensional PS adjustment pipeline20,21 to identify the potential PASC conditions that showed significantly higher risk during 30–180 days after lab-confirmed SARS-CoV-2 infection compared to patients who tested negative for SARS-CoV-2, after adjusting for a comprehensive list of hypothetical confounders, including demographics, baseline medical utilizations and comorbidities, and severity during the acute infection phase. More details can be foud in Zang et al.21, and the results on different PASC conditions are provided in Supplementary Table 7. We selected a particular PASC condition if either of the following two conditions was satisfied.

Hazard ratio (positive over negative) was larger than 1, and P value was smaller than 0.05/137.

Hazard ratio is larger than 1, and P value is smaller than 0.05, and it is reported or suspected in the PASC-related works10,22.

With these two rules, we were able to identify 44 potential PASC conditions, which are provided in Supplementary Table 2.

All statistical analysis was performed with Python 3.7—Python package scikit-learn-0.23.2, numpy-1.16.5, umap-learn-0.5.1, Pydpm-3.0 and scipy-1.7.3 for machine learning models.

Ethics approval

Use of the INSIGHT data was approved by the institutional review board (IRB) of Weill Cornell Medicine following protocol 21–10–95–380 with the title ‘Adult PCORnet-PASC Response to the Proposed Revised Milestones for the PASC EHR/ORWD Teams (RECOVER)’. Use of the OneFlorida+ data for this study was approved under University of Florida IRB number IRB202001831. All EHRs used in this study were appropriately de-identified, and, thus, no informed consents from patients were obtained.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif