Refining epigenetic prediction of chronological and biological age

Data overview

Generation Scotland is a Scottish family-based study with over 24,000 participants recruited between 2006 and 2011 [19]. Blood-based DNAm levels at 752,722 CpG sites were quantified using the Illumina MethylationEPIC array for 18,413 individuals. Participants were aged between 18 and 99 years at recruitment, with a mean age of 47.5 years (SD 14.9, Table 1). The data was processed in three sets (NSet1 = 5087, NSet2 = 4450, NSet3 = 8876), with a total of 121 experimental batches (see Additional file 1).

Table 1 Age profile and test set prediction performance for cohorts used in cAge predictor training and testing. Predictions were made using a LOCO approach, where each cohort was excluded in training and the resulting model was used for testing (see Methods). Models were trained on age, and if an individual was predicted to be under 20, their prediction was re-estimated considering models trained on log(age). External cohort information taken from Zhang et al. [5]. r column states Pearson correlation, RMSE the root mean squared error, and MAE the median absolute error

In order to train and test a cAge predictor, Generation Scotland data as well as that from an additional 6261 individuals from ten external cohorts were considered. These included the Lothian Birth Cohorts (LBC) of 1921 and 1936 [20, 21] and eight publicly available Gene Expression Omnibus (GEO) datasets (Table 1) [4, 22,23,24,25]. In addition, the independent dataset GEO GSE55763 [13, 26] (2711 samples from 2664 individuals) was used to assess cAge clock performance against existing clocks in individuals not used for training across any of the predictors considered. Given that the external datasets assessed DNAm (blood-based apart from GSE78874, which considered saliva) using the Illumina HumanMethylation450K array, the Generation Scotland data were subset to 374,791 CpGs that were present across all studies. Missing values were mean imputed per CpG and per cohort.

The bAge predictor was trained using data for 18,365 participants from the Generation Scotland cohort for which valid death status data (i.e. death status non-missing, and age at death not lower than age at baseline) via linkage to the National Health Service Central Register was available. A total of 1214 participant deaths have been recorded as of March 2022, when records were last updated. Alive individuals in March 2022 were censored at their age at that time (TTE thus being age in March 2022 minus age at baseline). Average TTE amongst deaths was 7.79 (SD 3.54) years, and average TTE amongst censored samples was 12.82 (SD 1.35) years. To test the bAge predictor, data from an additional 4134 individuals (with a total of 1653 deaths) from four external cohorts (six datasets) were considered. These included the baseline samples of both the LBC1921 and LBC1936 cohorts, as well as the Framingham Heart Study (FHS) [27,28,29] and the Women’s Health Initiative (WHI) [30, 31] Broad Agency Award 23 (B23) study for Black, White, and Hispanic individuals (Table 2).

Table 2 Cox proportional hazards output for GrimAgeAccel and bAgeAccel in the test datasets. Hazard ratios are presented per standard deviation of the GrimAgeAccel and bAgeAccel variables. Further details in Additional File 4: Table S11. Asterisk symbol (*) indicates the following: the FHS cohort used here was the same as the test set from the original GrimAge paper

A detailed description of the datasets used (Generation Scotland, GEO, LBC, FHS, and WHI) can be found in Additional file 1.

Epigenome-wide association study of chronological age

We conducted an EWAS to identify CpG sites that had linear or quadratic associations with chronological age, using Generation Scotland data (N = 18,413, CpGs = 752,722). Linear regression analyses were carried out which included both linear and quadratic CpG M-values as independent variables and age as the dependent variable (Age ~ CpG and Age ~ CpG + CpG2, respectively). Fixed effect covariates included estimated white blood cell (WBC) proportions (basophils, eosinophils, natural killer cells, monocytes, CD4T, and CD8T cells) calculated in the minfi R package (version 1.36.0) [32] using the Houseman method [33], sex, DNAm batch/set, smoking status (a factor with 5 levels: current, gave up in the last year, gave up more than a year ago, never, or unknown), smoking pack years (number of packs of cigarettes smoked per day, 20 cigarettes per pack, multiplied by the number of years the person smoked), and 20 DNAm principal components (PCs) to correct for unmeasured confounders. Family structure was not accounted for given the nature of the phenotype. Age was centred by its mean, and CpG/CpG2 M-values were scaled to mean zero and variance one. Epigenome-wide significance was set at p-value < 3.6 × 10−8, as per Saffari et al. [34]. For each CpG-age association, F-tests were used to compare models including the CpG as a linear term, versus one including both linear and quadratic terms, whilst controlling for all covariates listed here.

Epigenome-wide association study of time to all-cause mortality

We conducted an EWAS to identify CpG sites (from a total of 752,722 loci) that were associated with time to all-cause mortality in Generation Scotland. Cox proportional hazards (Cox PH) regression models were fit for each CpG site as the predictor of interest using the coxph function from the survival R package (version 3.3.1), with time to all-cause mortality or censoring as the survival outcome. Fixed effect covariates included those used in the cAge EWAS (age at baseline, sex, batch/set, smoking status, smoking pack years, WBC estimates, and top 20 DNAm PCs). Epigenome-wide significance was set at p-value < 3.6 × 10−8.

To assess whether relatedness in the cohort influenced the results, we fit a Cox PH model with a kinship matrix for each significantly associated CpG, using the coxme R package (version 2.2.16).

Prediction of chronological age

We used elastic net regression to derive a predictor of chronological age from the 374,791 CpG sites common across all cohorts considered in cAge training (description of cohorts in Table 1). The L1, L2 mixing parameter was set at α = 0.5 based on epigenetic clock precedent [3, 5]. The biglasso R package (version 1.5.1) was used [35], with 25-fold cross-validation (CV; ~ 1000 individuals per fold) to select the shrinkage parameter (λ) that minimised the mean cross-validated prediction error. A sensitivity analysis was performed, assigning individuals from the same methylation set, batch, and cohort to individual folds, which returned highly similar results.

The effect of including external cohorts in training, as well as accounting for non-linear relationships and pre-selection of features, amongst others, is briefly detailed in Additional file 2. As a result of these analyses, we created a predictor making use of a LOCO framework, training on both log(age) and age, and performing feature pre-selection ahead of elastic net. Here, we describe each of these steps.

Leave-one-cohort-out

cAge predictors were created using a LOCO framework where, for each of the 10 external cohorts, a model was trained in Generation Scotland and all but one of the external cohorts (Fig. 2). We then tested each of the 10 trained models on the excluded cohort. A final model was trained using all 11 datasets. Pearson correlations (r) of cAge predictions with reported age were calculated along with the root mean square error (RMSE) and median absolute error (MAE).

Fig. 2figure 2

Flowchart for the creation of the cAge predictor. First, DNAm data originating from Generation Scotland and 10 external datasets was pre-processed. Next, CpGs were pre-selected based on the Generation Scotland EWAS for epigenome-wide significant linear and quadratic CpG-age associations. Elastic net models were then trained and tested on the remaining features using a LOCO framework with 25-fold CV, with training on both age and log(age) as outcomes

Log(age)

In addition to training on chronological age, we also trained models on the natural logarithm of chronological age, log(age), using the same LOCO framework as described above. The age of our test samples was predicted using the model trained on chronological age and, if the value returned was 20 years or younger, a new prediction was obtained making use of the model trained on log(age).

Feature pre-selection

Several studies have highlighted the benefits of feature pre-selection for elastic net [36, 37]. Here, we performed preliminary analyses, including differently sized subsets of CpG sites as features in elastic net. After filtering for CpGs present across all datasets (374,791), we considered sites that were epigenome-wide significant at p < 3.6 × 10−8 and then ranked CpGs in ascending order of p-value (most significant ranked first), before defining subsets of varying sizes (from 1000 to 300,000 CpGs). For the purpose of selecting an optimal number of pre-selected CpGs, we performed a screening using Generation Scotland as our training cohort and GSE40279 (one of the largest external datasets with a wide age range) as our test set. Our analyses showed that the 10,000 most significant loci (age—CpG associations) yielded the test set predictions with the highest r and lowest RMSE (see Results). In addition to these sites, subsets of CpGs with a significant quadratic relationship to age were explored, with subset sizes varying from 100 to 20,000. These features were included in training as CpG2 beta values and, when not already present in the model, in their linear form as well. In addition to the top 10,000 age-associated CpGs, the top 300 quadratic sites from our EWAS yielded the best performing model (see Results). This final list of features was then used as input for the LOCO framework described above. The final models, trained on all datasets, selected a λ of 0.0308 for the model trained on age and a λ of 0.0006 for the model trained on log(age).

Comparison to ZhangAge, HannumAge, and HorvathAge

Our final cAge predictor (trained on all 11 datasets in Table 1) and those by Zhang et al. (ZhangAge) [5], Hannum et al. (HannumAge) [4], and Horvath (HorvathAge) [3] were projected onto the GSE55763 dataset to compare their performance in an independent test set. External clock predictions were calculated using the methylCIPHER R package [38] (https://github.com/MorganLevineLab/methylCIPHER).

Prediction of time to all-cause mortality as a proxy for biological ageTraining in Generation Scotland

To train a bAge predictor, component scores for GrimAge were estimated for all Generation Scotland samples via Horvath’s online calculator [17] (http://dnamage.genetics.ucla.edu/new). These included EpiScores of smoking and seven proteins—DNAm ADM, DNAm B2M, DNAm cystatin C, DNAm GDF15, DNAm leptin, DNAm PAI1, and DNAm TIMP1. Each variable was then standardised to have a mean of zero and variance of one. We also considered DNAm EpiScores for 109 proteins as described by Gadd et al. [14]. The 109 EpiScores were projected into Generation Scotland via the MethylDetectR [39] Shiny App (https://shiny.igmm.ed.ac.uk/MethylDetectR/) before being standardised to have a mean of zero and variance of one.

This resulted in 116 protein EpiScores, a smoking EpiScore, plus chronological age and sex as features for an elastic net Cox PH model (R package glmnet version 4.1.4), using time to all-cause mortality or censoring as outcome. A 20-fold CV was performed (with approximately 1000 individuals per fold), with individuals from the same Generation Scotland technical batch (see Additional file 1) included in the same fold, and with Harrell’s C index used to identify the optimal λ value (0.0025).

Testing in LBC, FHS, and WHI

We defined bAge as the weighted linear combination of covariates selected by our Cox PH elastic net model (see Results). These estimates were then scaled and returned as a predictor with mean of zero and variance of one, for each dataset. A bAgeAccel estimate was also calculated, which is the residual of bAge regressed on chronological age to obtain measure of accelerated epigenetic ageing.

After regressing on age, we assessed the association between our bAge clock, as well as GrimAge, PhenoAge, and DunedinPACE, and time to all-cause mortality in LBC1921 and LBC1936. GrimAge and PhenoAge were calculated using Horvath’s online calculator [17], whilst DunedinPACE was calculated via the DunedinPACE R package [16] (https://github.com/danbelsky/DunedinPACE). Cox PH models, adjusting for age and sex, were used to evaluate associations between the clocks and all-cause mortality. Further, Cox PH models treating GrimAge, PhenoAge, and DunedinPACE (in turn) as a covariate in addition to our bAge clock were run to assess our predictor’s independent association with mortality.

Finally, associations with time to all-cause mortality in four additional external datasets (FHS, and the WHI studies for White, Black, and Hispanic ancestries) were assessed for GrimAge and bAge, the clocks with the largest associations in the LBC cohorts (Table 2).

We examined Schoenfeld residuals in the LBC1921 and LBC1936 Cox PH models that included age, sex, and our bAge clock as covariates to check the proportional hazards assumption at both global and variable-specific levels using the cox.zph function from the R survival package (version 3.3.1).

CpG-based predictor of mortality

We also investigated a direct CpG predictor for time to all-cause mortality (methods and results described in Additional file 2). This predictor was found to have weaker associations with time to all-cause mortality in the LBC cohorts than the aforementioned bAge estimate, both when training just on CpGs as well as when considering both CpGs and EpiScores as training features.

Enrichment analyses

Gene set enrichment analyses were performed using the Functional Mapping and Annotation (FUMA) GENE2FUNC tool [40], which employs a hypergeometric test. Background genes employed included all unique genes tagged by CpGs in the EPIC array. A false discovery rate (FDR) p-value threshold was set at 0.05, and the minimum number of overlapping genes within gene sets was set to 2. These analyses did not explicitly account for Illumina chip biases relating to how CpGs are annotated to genes [41], which may have influenced our results.

留言 (0)

沒有登入
gif