DNA methylation clocks struggle to distinguish inflammaging from healthy aging, but feature rectification improves coherence and enhances detection of inflammaging

Analysis of elastic net CpG selection

The elastic net performs feature selection via the L1 norm and minimizes the cost function over the training data [2]. However, in biology, the focus is not the numerical value of a minimized cost function (e.g., − 42.42), but rather the specific CpG biomarkers selected and their age-related changes that influence gene expression. Therefore, we asked three aging-relevant questions: 1) whether a CpG probe’s correlation with chronological age was related to its importance in the model, 2) if EN selects CpGs based on an increased, age-related dysregulation, and 3) the possibility that the elastic net selects CpGs that are associated with inflammaging.

These studies were applied to all major current clocks: first-generation Horvath multi-tissue and Hannum blood age predictors and the next-generation predictors, PhenoAge, AdaptAge, DamAge, and CausAge. Additionally, we analyzed the GpG probe selection for DunedinPACE, which is a putative measurement of the rate of biological aging. For our analysis we created a composite dataset (GSE42861, GSE125105, GSE72774, GSE106648) with datasets where DNA methylation was probed from either whole blood or PBMCs from healthy controls and patients suffering from diseases that are associated with inflammaging, such as rheumatoid arthritis, Parkinson’s disease, depression, and multiple sclerosis [24,25,26,27,28,29,30]. We define inflammaging as any disease being linked to persistent inflammation within one or more tissues, the risk for which increases with aging.

First, we tested for correlation between a CpG’s importance in a model and the change in its beta values with age. Normalized feature importances were calculated for each CpG of each DNAm clock (methods), with an importance near zero indicating low importance, and a value of one indicating the most influential CpG in the model. For each model, we regressed the beta values of each CpG individually on the chronological ages of the healthy controls and plotted its R-squared value against its relative importance (Fig. 1a). Interestingly, in contradiction to the notion of predicting aging from the changes in the methylome, the most important CpGs were, in all models, those unchanged or changed the least in their methylation with age. Given that there are probes for which |r|> 0.8 for their age correlation (e.g., cg16867657), these results suggest that correlation of CpG methylation with age is not the primary criterion for its importance in a model, otherwise such CpGs would tend to be over-represented in EN models, and with higher importances than CpGs with lower age correlations.

Fig. 1figure 1

In all tested models, the importance of clock’s CpGs does not positively correlate with the changes in DNAm. a R-squared of the age correlation of a feature vs. its normalized importance in the respective model. Color-coded dashed lines represent the mean R-squared for a model’s features’ age correlations. The probes with the highest importance for all models had |r|< 0.45. b R-squared of the residuals of a feature’s age correlation vs. its normalized importance in the respective model. c Negative log of a feature’s p-value (H0: = no difference in distributions of beta values between controls and samples with inflammaging) vs. its normalized importance in the respective model

Next, following up on our discovery that DNAm noise is a biomarker of aging [23], we explored the relationship between a selected CpG’s increase in variance with age and its relative importance (Fig. 1b). To measure age-dependent variance, we took the residuals from the age correlations for each CpG and regressed them on chronological age. We then plotted the R-squared value for this regression against the CpG’s relative importance. Once again, we found a trend of decreasing correlation between age-specific CpG dysregulation and its importance in the EN model: features with the highest noise (|r|≤ 0.4) had the lowest importance, while the CpGs with the highest importance for each model tended to have low noise (|r|≤ 0.24). This suggests that age-related noise is not the primary criterion for feature importance in EN clocks.

Next, we explored the possibility that the elastic net selects CpGs which are significantly different between healthy people and patients with diseases associated with inflammaging (Fig. 1c). To do this, we performed a two-tailed Mann–Whitney U test (methods) on the beta value distributions of each CpG probe between healthy controls and patients. We then plotted the negative log of the p-value against the relative importance of each CpG in its respective model. Higher negative log values indicate larger differences between patients and healthy controls for that CpG probe. Again, as with the lack of CpG relevance to age-specific changes in values or dysregulation, there was a general trend of sharply increasing EN-assigned importance of CpGs that differ the least between healthy people and patients, indicating that the elastic net does not select CpGs based on their differential methylation due to inflammaging.

Thus, all examined clocks not only fail to weight DNAm biomarkers of aging by significance but also appear to preferentially exclude such biomarkers. The exclusion of biologically relevant CpGs results from the combination of the L1 penalty for feature selection and regularization, as well as minimizing the EN cost function. This function intentionally penalizes CpGs with little-to-no overall individual correlation with age or its proxies, even if they may correlate differently between various training cohorts (e.g., healthy control versus patient samples), by reducing their weights to zero (Box 1). The larger the signal shift between two cohorts in the training data at a CpG probe, the more likely EN is to assign a lower weight to that CpG. However, CpG sites with high degrees of differential methylation are often of the greatest biological interest, as they may indicate key differences related to inflammaging and, consequently, biological age acceleration.

Box 1 Demonstration: The elastic net may select against differentially methylated CpGs within the training setQuantifying normal variance in DNAm clock residuals

One method for predicting biological age with DNAm clocks is to use the residual or difference between sample’s predicted age and their known chronological age, with deviations suggested to indicate age acceleration [1, 3,4,5, 7,8,9,10,11,12]. However, residuals would be expected to not only include biological age acceleration, but also natural variation in the healthy human methylome as well as random experimental variation during the assays, data processing, and variance due to approximating a non-linear process with a linear model.

To quantify normal variation and determine thresholds for potentially aberrant methylomes, we fit a probability distribution to the residuals of predicted ages for healthy controls from the composite dataset (GSE42861, GSE125105, GSE72774, GSE106648) for each predictive model. The residuals were approximately normally distributed according to Q-Q plots (Supplementary Fig. 2a). Thus, a normal distribution was fit to the residuals for each model (Fig. 2a). We also investigated the residual distributions at 10-year age intervals for each model (Supplementary Fig. 2b).

Fig. 2figure 2

Normal variance in predictions of biological age for healthy people is much larger than the residuals of the predictions for the inflammaging samples. a Fitted normal distribution for the residuals of first- and next-generation clocks on healthy controls in the composite dataset. b Fitted normal distributions for the residuals of clocks on the inflammaging cohort in the composite dataset. c KDE plots with histograms normalized by cohort size comparing the distributions of residuals for the healthy control cohort (blue) and inflammaging cohort (red) in the composite dataset for first- and next-generation clocks (d: = Cohen’s d). d DunedinPACE score calculated for the healthy control and inflammaging cohorts vs. chronological age. Blue and red dashed best-fit lines are for the healthy and inflammaging cohorts, respectively. e Normalized KDE plots and histograms comparing the distributions of PACE scores for the healthy and inflammaging cohorts. Corresponding dashed lines represent cohort means

Next, we calculated the 95% interpercentile range (95-IPR, i.e., the range containing the central 95% of residuals) for the healthy cohort (HC) for each model. Additionally, we calculated the means of the residuals of the healthy cohort and inflammaging cohort (IC) to quantify how well-calibrated each model is to test data. We define good calibration as having a residual mean close to zero, indicating there is no bias for having more positive or more negative residuals. These results, summarized in Table 1, demonstrate that, when correcting for the mean shift, residuals of up to 10.8 years fall within the expected range for healthy individuals for the model with the smallest 95-IPR and thus the highest precision (Hannum). Averaging over all models tested, a residual of over 24.6 years would be needed to be outside the expected range for healthy individuals, when correcting for mean shifts. The mean shifts for the inflammaging cohort for all models are less than this threshold by an order of magnitude, suggesting that all models lack sufficient power to detect the age-accelerating effects of inflammaging (Table 1, Fig. 2b).

Table 1 Residual distribution and calibration of DNAm clocks

We performed a two-tailed Welch’s t-test to determine statistical significance between the distributions of residuals for the healthy and inflammaging cohorts (Fig. 2c). Importantly, high numbers of samples can yield artificially low p-values so we also quantified effect size, which is more indicative of the power of a model. To this end, we used Cohen’s d, which is a signal-to-noise ratio, and we used the effect size rule of thumb suggested by Sawilowsky (2009) [32]. Of the models, only PhenoAge and AdaptAge had statistically higher means of the residuals (p < 0.05) for patients, but both have small effect sizes. Moreover, the higher mean shifts for the inflammaging cohort were well within the 95-IPR for both models. These results suggest that all the models investigated lack sufficient power to detect the age-accelerating effects of inflammaging via residual.

Next, we analyzed the DunedinPACE model separately as it yields a score rather than a predicted age, with a higher score indicating a higher rate, or pace of aging [6]. We first generated a PACE score (methods) for each sample in both the healthy and patient cohorts of the composite dataset and plotted the PACE score versus chronological age for both (Fig. 2d). PACE was trained on samples from people who were all ∼53 years, yet interestingly, it worked as an age predictor when tested on the combined dataset that spans 20–90 + yrs. This observation demonstrates that training predictors that are proxies of chronology, e.g., instances of disease increase with aging, continue to predict age.

Next, to determine if PACE detects inflammaging, we performed a two-tailed Welch’s t-test, comparing the means of the distributions between the cohorts and calculating the effect size (Fig. 2e). While the mean of the inflammaging population slightly shifted upwards with d = 0.38, this is a small effect size for inflammaging, a hallmark of aging that should be robustly discernable.

Summarily, most models failed in distinguishing health from inflammaging, and even PACE, which was trained on extreme differences between people who were healthy vs. those who had heart attacks, strokes, etc., had only a marginal power to detect the age-accelerating effects of inflammaging. Overall, there remains room for improvement in feature selection.

OLS regression suffices for a generalizable age predictor with hypothesis-driven feature selection

DNAm clocks trained on DNA methylation array datasets typically include tens to hundreds of thousands of CpG features (p) and hundreds to a few thousand samples (n). Ordinary least squares regression has the advantage of good interpretability on the correlations between the response and predictor variables, but on this structure of data, where p > > n, OLS tends to have significant overfitting issues and has very high variance on out-of-sample data [2]. Elastic net regression models combat this issue via regularization, which performs feature selection and introduces bias to the model, effectively reducing variance on out-of-sample data. This makes EN more generalizable than ordinary least squares models. However, as demonstrated, EN clock models suffer a lack of biological interpretability and may select against features with the most biological significance.

We hypothesized that, due to the presence of many CpGs with high linear age correlations, it is possible to construct a generalizable and interpretable simple ordinary least squares age predictor from a reduced set of CpGs that have a priori biological relevance. To test this hypothesis, we implemented a modified version of the forward stepwise selection (mFSS) algorithm (Fig. 3a, Algorithm 1).

Fig. 3figure 3

Hypothesis-driven mFSS OLS model. a Schema of the conventional forward stepwise selection and the modified forward stepwise selection algorithms. b (Top) Pearson correlation vs. number of features included in the model. (Bottom) Mean squared error vs. number of features included in the model. Vertical dashed red line indicates the optimum number of features. c The optimum mFSS model predictions on the training set (GSE40279). Black dashed line is y = x. d mFSS model predictions on the test set (GSE55763). e mFSS model predictions on out-of-sample test data with rheumatoid arthritis patients and healthy controls (GSE42861). f mFSS model predictions on out-of-sample test data with depression patients and healthy controls (GSE125105). g mFSS model predictions on out-of-sample test data with multiple sclerosis patients and healthy controls (GSE106648). h mFSS model predictions on out-of-sample test data with Parkinson’s disease patients and healthy controls (GSE72774)

Algorithm 1figure a

modified Forward Stepwise Selection (mFSS)

Briefly, we begin with a list of features, ranked by some biological criterion (e.g., correlation with age, increase in noise with age). An OLS regression model is trained as each feature is sequentially added to the model, recording the value of a cost function (e.g., mean squared error) on a test set at each iteration. If the addition of a feature achieves a new low for the cost function, we record that model as the best model. We employ a patience hyperparameter P which allows for up to P unproductive feature additions (i.e., producing a model with a new minimum cost) before stopping and returning the last best model with its optimal feature selection. The algorithm stops either after P unproductive additions or after all features in the initial list have been added.

We generated a ranked list of features from a 450K DNAm array dataset (GSE40279, N = 656) by regressing chronological age on each CpG feature independently, ranking the CpGs by the magnitude of their age correlation. This same dataset was used for training, and we used an out-of-sample dataset (GSE55763, N = 2631) as a test set. We next trained an mFSS OLS model, splitting the training data (85/15) into training and validation sets (Fig. 3a, b). The selected features and weights for this model are in Supplementary Excel 1. The top-ranked CpG alone yields a model with r > 0.8, and the top 74 CpGs yielded the best OLS model with r = 0.9 and mean absolute error (MAE) of 3.7 years in the test set (Fig. 3b–d). The mFSS clock displays high accuracy and low variance on out-of-sample data, demonstrating that it is generalizable and does not suffer from overfitting (Fig. 3e–h). Importantly, the mFSS clock trained on CpGs with the highest age correlation is well-calibrated to test data, outperforming published EN-trained clocks.

The biological relevance of the genes annotated to the CpGs used in our mFSS OLS model is known and their inclusion in the model is due to their age-specific change in DNA methylation. An analysis of the roles of the top CpG-annotated genes in our mFSS that overlap between the 450 K datasets, suggests that they mostly reflect the aging-related changes, such as increase in body mass/fat, repeated instances of viral infections and inflammation, pre-cancerous growths, alcohol detoxification, and changes in teeth/hair/skin. (Supplementary Excel 2). Other notable roles include protein degradation, maturation of the extracellular matrix, and perception of pain.

Coherence theory of disease-related shifts for linear models

It has been suggested, without empirical verification by randomized blinded studies, that deviations between predicted and actual ages signify age acceleration [1, 3,4,5, 7,8,9,10,11,12], yet our data shows that such deviations could be also due to random experimental variation and healthy dynamics of DNAm. Additionally, EN clocks do not reliably distinguish healthy people from patients diagnosed with diseases, including those of inflammaging ([23] and Fig. 2).

At the same time, as expected, there are some CpGs that reliably detect inflammaging-specific methylation; Supplementary Fig. 3a shows two such cases of rheumatoid arthritis. In solving this conundrum where inflammaging vs. health definition is low in linear models, but DNAm changes with inflammaging are pronounced in the DNAm array data, we note that both increased and decreased shifts in DNAm that are disease-specific are possible, and pairing those signal shifts with weights, which can be either positive or negative, in linear combination may cause a negation of the disease signal present at individual CpGs. This implies that diseases do not inherently produce an upward shift in predicted age (or age acceleration) when simply summing methylation states across CpGs in a linear model. It would be more robust and better reflect the biology to pair a positive shift with positive weight and negative with negative.

In linear models, the signs of the weights are determined by the solution which minimizes the cost function of the model and need not depend on the signs of the correlations between individual features and the response variable (i.e., a feature with a positive association with the response variable may still be assigned a negative weight in the multiple regression solution, and vice versa). Similarly, individual feature shifts can be either positive or negative, depending on the agreement in sign between the model weights and signal shifts (Box 2, Supplementary Fig. 3b). This can cause an artificial cancellation effect and can produce a net age shift near or equal to zero despite significant differences between the methylomes of inflammaging and healthy states. We define such a property of a linear model as model incoherence.

Box 2  Definitions of signal shift, age shift, feature shift and feature rectification 

Conversely, we define the property where all feature shifts agree in sign to produce a net positive age shift as model coherence. Due to the association of overprediction of age with age acceleration, we define positive feature shifts as coherent and negative feature shifts as de-coherent. The degree of incoherence in a model is expected to vary for different diseases and to depend on the interactions between the feature shifts for a given disease. Supplementary Fig. 3b summarizes the interactions between the signal shifts and model weights which produce coherent and de-coherent feature shifts.

Model coherence is a desirable property if researchers would like their DNAm age predictors to not only predict chronological age, but also to detect shifted trajectories of biological age. Here, we propose a method, which we term feature rectification, for inducing model coherence during model training by allowing only positive feature shifts. While the signs of feature weights for a model cannot be known a priori, it is possible to constrain a model to have non-negative weights, thereby eliminating de-coherent feature shifts, where signal shifts are positive.

mFSS OLS for coherent, generalizable age predictor, and inflammaging detector

Following the success with the toy data, we tested our coherence theory for linear models by training a mFSS model and with feature rectification for actual PBMC samples from people with rheumatoid arthritis (RA) and healthy controls. We calculated the RA signal shifts in the GSE42861 dataset (Supplementary Excel 3) for the top 10,000 age-correlated CpGs (with respect to the samples in GSE40279) and performed the reflection transformation for the negatively shifted probe signals in the RA test set and the GSE40279 and GSE55763 training and validation datasets, respectively (methods). Next, we trained an mFSS OLS model constrained to have non-negative weights. The features were fed into the model ranked by the magnitude of their age correlation in the training set. The validation set MSE was optimized at the 680th feature, with 124 features having non-zero weights (selected features and weights, feature rectification information in Supplementary Excel 4, training and validation performance in Supplementary Fig. 4a).

We next obtained age predictions for the RA test set and plotted them against the actual ages. Figure 4a compares the incoherent, basic mFSS OLS model predictions and the mFSS OLS model trained on rectified features. The coherence is apparent for the latter model, with the RA cohort predictions visibly shifted upward relative to the healthy control (HC) cohort. We quantified the shift by performing a two-tailed Welch’s t-test on the distributions of the residuals of the HC and RA cohorts (p = 1.9e − 25) and calculating the effect size of the shift (d = 0.83) (Fig. 4b and c). Although the RA dataset was not used to train the model directly, the feature rectification was based on the signal shifts between the RA and HC cohorts in the GSE42861 dataset. To test whether the coherence effect is still present in completely out-of-sample data, we predicted the ages on several other datasets with different disease cohorts (multiple sclerosis, depression, Parkinson’s disease, and tauopathies) (Supplementary Fig. 4b-e). Each of the disease cohorts was slightly up-shifted relative to their respective healthy control cohorts, but with very small effect sizes ranging from d = 0.14 to d = 0.19.

Fig. 4figure 4

Feature-rectified coherent mFSS model detects inflammaging. a (Left) Scatterplot of ages predicted by the basic mFSS clock vs. actual ages for the GSE42861 dataset colored by disease state (healthy controls in blue; rheumatoid arthritis in red) compared to predictions of a coherent mFSS model trained on rectified features based on RA signal shifts (right). Correlation, p-value, and MAE calculated for the control cohort only. b Box plots of the residuals for healthy controls and rheumatoid arthritis patients corresponding to the basic mFSS model (left) and coherent mFSS model trained on rectified features (right). p-values determined by Welch’s t-test. Effect size (d) determined by Cohen’s d (methods). c Histograms and KDE plots, normalized by cohort size, of the residuals for healthy controls and rheumatoid arthritis patients corresponding to the basic mFSS model (left) and coherent mFSS model trained on rectified features (right). Vertical lines indicate the cohort means. d Scatter plot of predicted age vs. actual age for a combined dataset (GSE42861, GSE72774, GSE106648) colored by disease state (healthy controls in blue; inflammaging-associated disease in red), compared to predictions of a coherent mFSS model trained on rectified features based on inflammaging signal shifts (right). Correlation, p-value, and MAE were calculated for the control cohort only. e Box plots of the residuals corresponding to the basic mFSS model (left) and coherent mFSS model (IR-mFSS model) trained on rectified features (right) on the combined dataset. f Histograms and KDE plots, normalized by cohort size, of the residuals for the basic mFSS model (left) and coherent mFSS model (IR-mFSS model) trained on rectified features (right) on the combined dataset. Vertical lines indicate the cohort means

Interestingly, training on features ranked instead by the magnitude of the signal shifts between healthy controls and RA reduced the accuracy of predictions of chronological age in all datasets and decreased calibration, while increasing the resolution between health and inflammaging for all but one disease (Parkinson’s) (Supplementary Fig. 5b-f). The MSE for the validation set was optimized at the 1656th feature with 47 features having non-zero weights (Supplementary Fig. 5a). This is expected, because to linearly predict age with the lowest error, the weights of the CpGs that change the most with inflammaging between people of the same age will be minimized via the least squares constraint.

Next, to make a more generalized inflammaging-coherent model, we trained a mFSS OLS model with feature rectification based on a combined dataset with inflammaging-associated disease cohorts (GSE42861, GSE106648, GSE72774), which we dubbed the inflammaging-rectified mFSS (IR-mFSS) model. The model was again trained on GSE40279 and validated on GSE55763, with the top 10,000 age-correlated CpGs (per the GSE40279 samples) ranked by absolute signal shift magnitude (Supplementary Excel 5) for the combined inflammaging dataset (selected features and weights, feature rectification information in Supplementary Excel 6, training and validation performance in Supplementary Fig. 6a). Figure 4d compares the incoherent basic mFSS OLS model predictions on the combined inflammaging dataset and the coherent IR-mFSS model. We quantified the shift by performing a two-tailed Welch’s t-test on the distributions of the residuals of the HC and inflammaging cohorts (p = 6.1e − 35) and calculating the effect size, which was medium-to-large (d = 0.65), as seen in Fig. 4e and f. Supplementary Fig. 6b-f show the distributions of the residuals for different datasets, comparing patients with inflammaging-associated diseases and healthy controls. The IR-mFSS clock increased the resolution over the basic mFSS clock and substantially for all diseases, with effect sizes ranging from small (d = 0.27) to very large (d = 1.3).

The genes annotated to the 47 CpGs selected by the IR-mFSS clock and their functions are provided in (Supplementary Excel 7).

While EN regression is not essential for creating an accurate and generalizable age predictor, an interesting avenue for future research would be exploring how EN would assign weights if trained only on data over the feature space of the IR-mFSS CpG selection. It would also be interesting to see the influence of EN on the interpretability and disease detection capabilities of our model’s feature selection.

留言 (0)

沒有登入
gif