Screening for idiopathic pulmonary fibrosis using comorbidity signatures in electronic health records

Individual diagnostic histories can have long-term memory59, implying that the order, frequency and comorbid interactions between diseases are important for assessing the future risk of our target phenotype. We analyzed participant-specific diagnostic code sequences by first representing the medical history of each participant as a set of stochastic categorical time series (one each for a specific group of related disorders) followed by the inference of stochastic models for these individual data streams. These inferred generators are from a special class of HMMs, referred to as PFSA26. The inference algorithm we used is distinct from classical HMM learning, and has important advantages related to its ability to infer structure, and its sample complexity. We inferred a separate class of models for the positive and control cohorts, and then the problem reduces to determining the probability that the short diagnostic history from a new participant arises from the positive as opposed to the control category of the inferred models.

Data characteristics

The age-wise breakdown of the cohorts in the Truven and UCM datasets are shown in Supplementary Tables 7 and 8, respectively. The number of diagnostic codes appearing in the Truven dataset is shown in Supplementary Table 9. The Truven dataset represents a national database, and the characteristics of participants in the UCM dataset, comprising participants primarily from Cook County and adjoining areas in the state of Illinois, are somewhat different. In the Truven dataset, we considered approximately 42 million diagnostic codes (with over 46,000 unique codes) for both sexes (Supplementary Table 9) in this analysis, and identified n = 2,053,277 participants, with 53,317 participants in the positive group and 1,999,960 participants in the control group.

Step 1: partitioning the human disease spectrum

We begin by partitioning the human disease spectrum into 51 nonoverlapping broad diagnostic categories. Some of the categories that we define comprise a relatively large number of diagnostic codes aligning roughly with the categories defined within the ICD-9 framework25. The remaining categories represent groups of one or more codes that might have some known or suspected association with pulmonary disorders (Supplementary Table 3 enumerates the categories used in this study). In total, we used 17,008,378 and 25,074,255 diagnostic codes for males and females, respectively (22,685 and 23,722 unique codes), spanning both ICD-9 and ICD-10 protocols (using ICD-10 General Equivalence Mappings60 equivalents where necessary), from a total of 2,053,277 participants. Transforming the diagnostic histories to report only the broad categories reduces the number of distinct codes that the pipeline needs to handle, improving statistical power. Note that defining these categories does not preselect ‘high-risk’ phenotypes; we want our algorithm to seek out the important patterns without any manual curation of the input data.

For each participant, the past medical history is a sequence (t1, x1), ⋯ , (tm, xm), where ti are time stamps and xi are ICD-9 codes diagnosed at time ti. We map individual participant history to a three-alphabet categorical time series zk corresponding to the disease category k, as follows. For each week i, we have the following equation (1):

$$_^=\left\0&}}}\,}}}\,}}}\,}}}\,}}}\,}}}\,i\\ 1&}}}\,}}}\,}}}\,}}}\,}}}\,}}}\,}}}\,k\,}}}\,i\\ 2&}}}\end\right.$$

(1)

The time series zk is observed in the inference period. Thus, each participant is represented by 51 trinary series.

Step 2: model inference and computing the sequence likelihood defect Δ

The mapped series are considered to be independent sample paths, and we want to explicitly model these systems as specialized HMMs (PFSAs). The use of PFSA models, along with a measure of divergence between such models known as the SLD, has been demonstrated to achieve high predictive performance, often superseding state-of-the-art frameworks, in multiple applications ranging from general classification problems to incidence forecasts during the coronavirus disease 2019 pandemic to questionnaire-free tools for autism screening27,61,62. We model the positive and the control cohorts and each disease category separately, ending up with 204 HMMs at the population level (51 categories, 2 IPF status categories: positive and control and two sexes). Each of these inferred models is a PFSA, a directed graph with probability-weighted edges, and acts as an optimal generator of the stochastic process driving the sequential appearance of the three letters (as defined by equation (1)) corresponding to disease category, and IPF status type. Structurally, PFSA models are substantially more compact, with the number of model parameters up to several orders of magnitude smaller compared to state-of-the-art NN models (Supplementary Table 5). This, we believe, contributes to ZCoR-IPF’s observed performance and robustness advantages over NN or deep learning models.

To reliably infer the IPF status type of a new participant, that is, the likelihood of a diagnostic sequence being generated by the corresponding IPF status-type model, we use the SLD, which generalizes the notion of Kullback–Leibler divergence63,64 between probability distributions to a divergence \(}}}}_}}}}(G| | H)\) between ergodic stationary categorical stochastic processes65G,H given by equation (2):

$$}}}}_}}}}(G| | H)=\mathop\limits_\frac\mathop\limits__(x)\log \frac_(x)}_(x)}$$

(2)

where ∣x∣ is the sequence length, and pG(x), pH(x) are the probabilities of sequence x being generated by the processes G,H, respectively. Defining the log-likelihood of x being generated by a process G according to equation (3):

$$L(x,G)=-\frac\log _(x)$$

(3)

The cohort type for an observed sequence x (which is actually generated by the hidden process G) can be formally inferred from observations based on the following relationships given by equations (4a) and (4b)27,62:

$$\mathop\limits_L(x,G)=}}}(G)$$

(4a)

$$\mathop\limits_L(x,H)=}}}(G)+}}}}_}}}}(G| | H)$$

(4b)

where \(}}}(\cdot )\) is the entropy rate of a process63. Equation (4) shows that the computed likelihood has an additional nonnegative contribution from the divergence term when we choose the incorrect generative process. Thus, if a participant is eventually going to be diagnosed with IPF, then we expect that the disease-specific mapped series corresponding to the participant’s diagnostic history be modeled by the PFSA in the positive cohort. Denoting the PFSA corresponding to disease category j for positive and control cohorts as \(_^,_^\) respectively, we can compute the SLD Δj according to equation (5):

$$}}^\triangleq L(_^,x)-L(_^,x)\to }}}}_}}}}(_^| | _^)$$

(5)

With the inferred PFSA models and the individual diagnostic history, we estimate the SLD on the right-hand side of equation (5). The higher this likelihood defect, the higher the similarity of diagnosis history to that of women with IPF.

The SLD with respect to each broad category is referred to as a PFSA score. In addition to the phenotype-specific PFSA models, we used a range of engineered features that reflect various aspects of the participant-specific diagnostic histories, categorized as the ‘sequence features’, prevalence scores (‘p-scores’) and ‘rare scores’ (see Extended Data 3 for complete list of such features).

Prevalence scores

The p-scores focus on individual diagnostic codes, and we created a dictionary of the ratio of relative prevalence of each code (relative to the set of all codes present) in the positive category (for each sex) to the control category. This is the second hyper-training step. In the later steps of the pipeline, we used dictionary look-ups to map codes to their p-scores, and also their aggregate measures such as mean, median and variance to train a downstream Light Gradient Boosting Machine (LGBM).

Rare scores

These scores consist of a subset of p-scores that correspond to codes with particularly high and low relative prevalences (p-score > 2 or < 0.5). Thus, this feature category depends on the p-score dictionary generated in the second hyper-training step.

Sequence scores

Sequence scores are relatively straightforward statistical measures such as mean, median, variance, time since last occurrence and so on, on the trinary phenotype-specific sex-stratified histories. No hyper-training is required for the generation of the sequence features.

Data splits: training and validation hold-out

To learn the complete set of 667 features, we required three splits of the training dataset. First, all eligible participants were randomly split into the training set (≈75% of data) and the test set (≈25% of data). The training set wss then split into three subsets: (1) the hyper-training set (Supplementary Fig. 2) was used to train PFSA models and the p-score dictionary; (2) the second split (referred to as the pre-aggregation split; Supplementary Fig. 2) was used to train the four feature-category-specific LGBMs; and (3) the final split (referred to as the aggregation split; Supplementary Fig. 2) was used to train the aggregate LGBM, which uses outputs from the trained LGBMs in the previous layer. This trained pipeline was then validated on the held-out validation split (≈25% of data). Figure 2c shows the top 20 features ranked in order of their relative importance (relative loss in performance when dropped out of the analysis).

Step 3: risk estimation pipeline with semi-supervised and supervised learning modules

The SLD along with a range of other engineered features, all functions of data available at the point of care (discussed below) were used to train a network of standard gradient boosting classifiers66 aiming to compute the ZCoR-IPF score.

Statistical analysis

We estimated 95% CIs for all reported metrics. The CIs for the AUC were calculated using the well-known equivalence of the AUC with the Mann–Whitney U statistic67,68,69. We also carried out bootstrapped runs over randomly selected sub-cohorts estimating the distribution of AUC over these runs; the mean AUCs obtained by this approach were within ± 1% of the U-statistic estimate for sufficiently large sub-cohorts. The CIs for specificity and sensitivity were computed via the asymptotic method for single proportions70 (also known as the Wald method). CIs for the remaining metrics were computed from the extremal values of the CI for specificity and sensitivity. We also computed P values for the null hypotheses that ZCoR-IPF performance is different from a corresponding baseline or NN model for the same sex and dataset. The performance difference in the case where the NN model ALEXNET marginally outperformed ZCoR-IPF was not significant; in almost all other cases (only exception being DENSENET for males applied to the Truven dataset), ZCoR-IPF outperformed the corresponding model significantly (Supplementary Table 10). These P values were computed using Dantzig’s upper bound on AUC variance71,72.

Confidence bounds on sensitivity, specificity, positive and negative predictive values and likelihood ratios

The simple asymptotic/Wald method70,73 without continuity correction is sufficient here because the number of samples is large. Thus, for CIs for a ratio of interest p (which in this context can be either specificity or sensitivity), we used equation (6):

where z is the 1 − α/2 point of the standard normal distribution. Note that Newcombe has collated at least seven different approaches to two-sided CIs for quantities that are essentially proportions (such as sensitivity, which is the ratio of true positives to all positives, and specificity, which is the ratio of true negatives to all negatives), including the Wilson score, continuity correction to the asymptotic estimates, methods using exact binomial tail areas and likelihood-based approaches73. The Wald method is the simplest of these, and more sophisticated approaches are called for when the normality assumption cannot be relied upon due to small sample sizes. We verified that these more complex approaches only add minute and insignificant corrections in our case.

Thus, since sensitivity is a ratio involving the number of participants in the positive category, given the empirically estimated sensitivity s, its confidence intervals are computed according to equation (7):

$$s\pm z\sqrt_}}}}}}$$

(7)

where npositive is the number of participants in the positive category, where as before, z is the 1 − α/2 point of the standard normal distribution.

Similarly, given the empirically estimated specificity c, its CIs are computed according to equation (8):

$$c\pm z\sqrt_}}}}}}$$

(8)

where ncontrol is the number of participants in the control category. Once the confidence bounds on sensitivity and specificity are determined, the confidence bounds on the remaining quantities are determined using the extremal values of sensitivity and specificity.

Confidence bounds on the area under the curve

The confidence intervals on the estimated AUC values are calculated from the equivalence of the area under the ROC curve with the Mann–Whitney U statistic67,68,69,71,74. Let X and Y be independent random variables, and let x1, x2, ⋯ , xi, ⋯ , xm and y1, y2, ⋯ , yi, ⋯ , ym be samples of X, Y, respectively. The U statistic is defined by equation (9):

$$U=}}}\,}}}\,}}}\,}}}\,(_,_)\,}}}\,}}}\,_ < _$$

(9)

It is known67 that the probability measure given by equation (10):

$$\rho \triangleq Pr\$$

(10)

can be estimated by the statistic given by equation (11):

Also it is easily shown that ρ is identical to AUC, and that under realistic assumptions, we have the following, given by equation (12)71:

$$^(\hat)\leqq \frac$$

(12)

and that \(\hat\) is an unbiased and consistent estimate of ρ. These results allow us to estimate CIs for AUC values at any given significance α, by computing those intervals for \(\frac\), essentially either by using the variance upper bound given above, or via more sophisticated reasoning where some of the assumptions in equation (12) are not satisfied67,71. In this study, we used the CIs for the U statistic computed by the scipy statistical toolbox for Python3.x.

Approach to comparison with the baseline model

We investigated if ZCoR-IPF outperforms standard or classical approaches, which typically have limited or no pattern discovery. To that effect, we trained and validated a baseline model that considers a broad set of fixed risk factors as binary features (recording presence or absence in participant history within past 2 years) and finally uses standard logistic regression to train a classifier predicting an IPF diagnosis (appearance of one of the target codes) 1 year in future. The 87 diagnostic codes defining the binary features used in this baseline model are enumerated in Supplementary Table 4, which relate to asthma and COPD-associated major codes, cough, dyspnea and other major pulmonary and cardiopulmonary complications. The baseline model also uses age as a feature (recording if participants are 65 years or older).

Approach to comparing with neural network architectures

To adequately compare the performance of NNs to our model, we used the same input data for each participant as we described in ‘ZCoR-IPF modeling, training and prediction’. To infer the risk score, we looked at 2 years of diagnostic records preceding the date of screening. As a result, each participant is represented by 51 104-week-long sparse stochastic time series of events corresponding to 51 different disease groups, each having 104 ternary digits for each week: ‘1’ if a disease from a given disease group was diagnosed that week, ‘2’ if any other disease was diagnosed that week and ‘0’ if no diseases were diagnosed that week. If no codes were recoded for a certain time series for the whole window of observation, all of its digits were marked as ‘−1’. For each participant, a list of 51 104-week-long time series is transformed into the 51 × 104 matrix, where time-series rows appear in the same order as they appear in inputs for training ZCoR-IPF (Supplementary Fig. 3).

We investigated the predictive performance achieved by diverse NN and standard deep learning architectures, ranging from simple feed-forward networks75 to long short-term memory (LSTM) models76, convolutional neural networks (CNNs77) to more advanced ALEXNET78,79, DENSENET80 and RESNET81, RETAIN82 and multichannel convolutional architectures (Supplementary Note). LSTM networks were chosen to investigate the impact of the ability of these architectures to model long-range longitudinal dependencies. CNNs were chosen to test their ability to leverage patterns and interdependencies across concurrent time series of different disease groups. The more standard architectures were investigated to evaluate the ability of past successful NN models in this context. The RETAIN architecture was investigated for its claimed optimized yet interpretable structure specifically targeting the health domain. Similar to the ZCoR-IPF approach, we trained and tested separate models for male and female cohorts.

We trained these networks to deliver the maximum out-of-sample performance on the Truven dataset (with similar test/train data splits as used for ZCoR-IPF) and recorded their performance on the UCM validation dataset. Similarly to ZCoR-IPF, we trained and tested separate models for male and female cohorts.

The performance of simpler architectures did not produce good results, irrespective of whether we used LSTMs or CNNs or modifications thereof. Multichannel architectures, for example, CNN_MULTICHANNEL, were also investigated to explore if performance enhancement is obtained from using 51 separate 104-digit-long one-dimensional arrays for 51 parallel sets of convolutional layers (instead of a 51 × 104 two-dimensional matrix as input) that are concatenated for the final set of dense layers that provide the final prediction. A summary of the different NN models we investigated is shown in Supplementary Table 5.

Our results suggest that simple NN architectures are substantially worse than ZCoR-IPF, but are generally better than the baseline model (Extended Data Fig. 2), while typically having larger model complexity. Exploring state-of-the-art deep NN models (Supplementary Table 5), we were able to train and validate the ALEXNET architecture to outperform ZCoR-IPF on out-of-sample data in the Truven dataset (ALEXNET, 94% vs ZCoR-IPF, 89% AUC). However, the subsequent performance of the trained ALEXNET on the UCM dataset was substantially worse (ALEXNET, 69% vs ZCoR-IPF, 93% AUC). At the same time, the ALEXNET model with 22,616,129 parameters was three orders of magnitude more complex compared to ZCoR-IPF with 64,091 parameters. We calculated P values for the significance of ZCoR-IPF outperforming (Supplementary Table 10) and show that (1) we significantly outperform competing models, and (2) the case where ALEXNET marginally beats ZCoR-IPF is not statistically significant.

We concluded that some NNs (ALEXNET) can achieve high out-of-sample performance in the Truven database, and at least one architecture (ALEXNET) successfully outperformed the ZCoR-IPF; however, performance of the trained NN pipeline can be significantly worse in an independent database, while the ZCoR-IPF maintains high performance across databases. This pattern seems to hold true for all NN models we experimented with. Note that this is not a simple case of overtraining, which was diligently accounted for in the training phase. Thus, it appears that the differences in participant characteristics between Truven and UCM (Table 2) have a dramatic impact on NN performance, while not affecting the ZCoR-IPF significantly. It is possible some specific encoding of the data can remedy this issue for NN and deep learning models.

Ablation studies and approach to establish the necessity of longitudinal patterns

Among the multiple feature categories, p-scores are designed to capture the impact of individual codes via computing the risk ratio of specific ICD codes in the positive cohorts compared with the control cohorts. The PFSA scores, on the other hand, are designed to capture longitudinal patterns in and across broad phenotypic categories described before. It is important to inquire if the added complexity of recognizing longitudinal patterns is at all necessary, that is, do the patterns that modulate risk have any longitudinal dependencies. We investigate this question in two ways. First, we consider ablation studies, where we, on one hand, train pipelines using only PFSA-based features and no p-scores and, on the other hand, train using only p-score-based features and no PFSA-based ones, thus eliminating longitudinal patterns. Second, we randomly permute the encounter time stamps for individual participants in the testing and validation stages and evaluate if such shuffled histories impact prediction decisions made by the ZCoR-IPF pipeline.

These interrogations show that longitudinal patterns inferred by ZCoR-IPF are key to maintaining high performance under noise and uncertainties (Supplementary Fig. 1). Our ablation studies on the different feature categories used by ZCoR-IPF show that the p-score by itself (without the PFSA-based features, and therefore having no longitudinal pattern discovery capability) does better in some scenarios than when using PFSA-based features alone. Nevertheless, the combined models perform better than either, suggesting that longitudinal patterns have quantifiable predictive value. Additionally, when we perturb participant histories by randomly swapping diagnostic codes with one representing similar or confounding disorders (Supplementary Table 11), p-scores alone have quite dismal performance, and the performance is recovered when we use PFSA-based features in combination with p-scores (Supplementary Fig. 1). This trend is replicated with the full cohort in the Truven dataset, as well as the high-risk and the low-risk sub-cohorts. We also evaluated the scenario where longitudinal patterns existing in the data are erased by randomly permuting the recorded time stamps in individual participant histories. When the out-of-sample validation cohort in the Truven dataset was permuted in this manner, we observed (Supplementary Fig. 1) that at 95% specificity, ZCoR-IPF decisions switched in 10% of the participants who were predicted to be in the positive cohort with original unperturbed histories. At 90% specificity, this fraction increases to 15%. Similarly, predictions switched from the control to the positive category in 0.5% of the cases at 95% specificity, and in 0.75% of the cases at 90% specificity. These investigations demonstrate that longitudinal patterns are indeed predictive of future IPF risk.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

留言 (0)

沒有登入
gif