The effect of family structure on the still-missing heritability and genomic prediction accuracy of type 2 diabetes

Study subjects

Individuals participating in the Tehran Lipid and Glucose Study (TLGS), the ongoing periodic cohort study of the Iranian population project, were included in this study. The TCGS population consists of individuals from diverse ethnic backgrounds, providing valuable insights into the diversity of the Iranian population. This ethnic information was gathered through self-reported data and questionnaires detailing the birthplaces of the past three generations [22]. Epidemiological data on non-communicable disorders’ risk factors has been collected from 15,000 participants of TLGS every three years for the past 25 years. All participants in the cohort study executed written informed consent prior to inclusion. TLGS participants were recruited in six phases between October 1, 1999, and April 1, 2018, with approximately three years between each phase. The first phase had 15,005 participants, and the second phase had 3,531 new participants [22, 23]. Here, 14,113 individuals aged > 20 years were selected from the dataset, of which 2,284 and 11 individuals with missing data on diabetes status and body mass index (BMI) were excluded, respectively. We meticulously excluded cases that might have been misclassified as type 2 diabetes. This involved excluding not only patients with Type 1 diabetes and congenital diabetes but also those with monogenic diabetes, particularly Maturity-Onset Diabetes of the Young (MODY) [24]. Therefore, 11,818 individuals, with an age of 45.7 ± 16.8 years, entered the study.

We used the TLGS standard questionnaire to collect demographics, medical, and drug history information. Weight was measured in kilograms and height in meters; subsequently, the BMI of participants was calculated using the formula of $\:BMI=\frac^}$. Moreover, following a 12–14 h overnight fast, blood samples were taken from all study participants to quantify biochemical parameters, including fasting plasma glucose (FPG) and 2-hour postprandial plasma glucose (2hpp). T2D was defined as if one of the following conditions were present: (i) treatment with antidiabetic drugs at least once in 6 phases, (ii) FPG was more than 126 mg/dL, or (iii) 2hpp was more than 200 mg/dL.

Genotyping, quality control, and imputation

All individuals were genotyped with HumanOmniExpress-24-v1-0 bead chip (Illumina, San Diego, CA) at deCODE genetic company. This bead chip provided 649,932 SNPs with an average mean distance of 4 Kb for each individual, as MS Daneshpour, et al. [25] described. To find any problem with recorded relationship information, a pedigree check was conducted using S.A.G.E (Statistical Analysis for Genetic Epidemiology) software v6.4 [26]. To perform a parentage check, snp1101 software v1.0 [27] was used to find contradictory information based on recorded parental and genotype platforms’ information [28]. This software checks for Mendelian inconsistencies and calculates the probability of correct parentage assignment. A conservative probability threshold of approximately 0.95 was used to ensure strong confidence in the detected parentage relationships. Any parent-offspring pairs with probabilities below this threshold were flagged for further investigation. We encountered inconsistencies in the parental information for a total of 132 individuals. Specifically, these inconsistencies pertained to the non-biological parent. To address this, we designated these 132 individuals as having unknown parentage with respect to the non-biological parent within the pedigree structure.

Quality control of individuals and genotypes was performed using PLINK software v1.9 [29]. We initiated the quality control (QC) process for both individuals and markers using PLINK software. Initially, we filtered out SNPs and individuals with a missing rate exceeding 20% to eliminate low-quality data, resulting in the removal of 770 SNPs and 11 individuals. This initial step was conducted using a non-stringent threshold. Subsequently, we applied a more stringent criterion by setting a threshold of 2% to exclude SNPs and individuals with more than 2% missing data. This step led to the removal of 17,636 SNPs, but no additional individuals were excluded. In the next phase, we examined sex discrepancies by comparing the recorded sex with genetic data derived from the X chromosome; no discrepancies were observed. To maintain the power of the study, we excluded SNPs with a minor allele frequency (MAF) of less than 0.05, which resulted in the exclusion of 72,500 SNPs. Additionally, markers that significantly deviated from Hardy-Weinberg equilibrium (HWE) were excluded using a p-value threshold of 1e-6, leading to the removal of 1,125 SNP markers. We also removed individuals whose heterozygosity rates deviated by more than ± 3 standard deviations from the mean, which led to the exclusion of 317 individuals. Population stratification was also checked using principal component analysis (PCA) using R software’s SNPRelate package [30]. Finally, the missing genotypes were imputed using Beagle software v5.4 [31].

Statistical analysis

The design of this study is shown in Fig. 1.

Fig. 1

Flowchart of population selection based on different scenarios. The overall design of this study was based on estimating of heritability based on models using A or G matrix computed using genealogical information or genome-wide common SNPs, respectively

Familial structure scenarios

Three different scenarios were used to investigate the effect of familial structure on still-missing heritability and genomic prediction accuracy (Fig. 1). In the first scenario, all families were used without any limitation (scenario i). In this scenario, of 11,818, we had 1,967 persons within 1,591 families with zero generation and respectively zero and non-zero pedigree- and genomic-based relatedness with the others. We evaluated the effect of these individuals on estimated heritability from pedigree and genome-wide markers in scenario ii by removing them from the data and comparing the results with the scenario i. This results in 9,851 individuals within 2,189 families with ≥ 1 generation. Of all families represented in the scenario ii, we had 1,091 families (3,959 individuals) that only have one disease status, case or control. We hypothesized that presenting families with only one disease status may induce overestimation of heritability in both pedigree- and genomic-based methods. This hypothesis is based on the understanding that heritability is a statistical measure quantifying the proportion of variation in a trait within a population attributable to genetic differences, ranging from 0 to 1. A value of 1 indicates that all observed variation in the trait is due to genetic factors, while a value of 0 suggests that the variation is entirely due to environmental factors. When all members of a family exhibit a particular disease, it implies that the disease is predominantly influenced by genetic factors within that family, resulting in heritability approaching 1. Conversely, if none of the family members have the disease, the heritability would still approach 1, indicating that genetic factors are the primary contributors to the absence of the disease within the family. This phenomenon can lead to an overestimation of heritability especially in pedigree-based estimations, as this method assumes an additive genetic relationship of zero between families, while SNP-based estimations consider non-zero genetic relationships between individuals across families. This hypothesis was tested by constructing scenario iii through removing families with only diabetic (case) or non-diabetic (control) individuals from scenario ii and resulting in 1098 families with both disease status represented within them.

Still-missing heritability estimation

To estimate heritability based on genealogical information ($\:^$), a generalized linear model framework was implemented. Each element of the response vector $\:\varvec=_\}$ had two possible values, i.e., presence $\:_=1$ or absence $\:_=0$ of T2D for the ith individual. We used a probit link function $\:\text\left(_=1|}_\right)=\left(_\right)$, where $\:\varvec$ and Φ are a vector of regressors and the standard normal cumulative distribution function, respectively, and $\:_$ is a linear predictor given by:

$$\:}_}=+\sum\:_^}}_\text}}_}+\:_}$$

(1)

where µ is an intercept, $\:_$ is the covariate of the $\:i$th individual at the $\:k$th effect (i.e., sex, age, and BMI), $\:_$ is the $\:k$th fixed effect, and $\:_$ is the random genetic effect of the $\:i$th individual. In this model we assumed a latent normally distributed variable $\:_=_+_$, where $\:_$’s are independent residual terms that follow the standard normal distribution, and a measurement model $\:_=0$ if $\:_<\gamma\:$, and 1 otherwise, where γ is a threshold parameter, and $\:_$ is an independent normal model residual with mean zero and variance set equal to one. In this equation, the vector of random additive genetic effect, $\:\mathbf=\left\_\right\}$ follows a multivariate normal distribution of $\:\mathbf\sim\text(0,\mathbf}_}^)$, where $\:\mathbf$ is the covariance matrix computed using pedigree information and $\:}_}^$ is an additive genetic variance caused by sample relatedness.

In order to estimate heritability based on genome-wide common SNPs ($\:}_\text\text}^$), we used a linear predictor given by:

$$\:}_}=+\sum\:_^}}_\text}}_}+\:}_}$$

(2)

which is comparable to model (1); however, $\:_$ is the random genomic effect which is assumed to have a distribution of $\:\mathbf\sim\text(0,\mathbf}_}^)$, where $\:\mathbf$ is the genomic relationship matrix constructed based on the method proposed by J Yang, et al. [32] and $\:}_}^$ is the genomic variance.

All models were fitted using the Bayesian approach implemented in the R package BGLR [33]. All Bayesian inference was conducted using a Markov chain Monte Carlo (MCMC) approach, Gibbs sampling. To estimate heritability, we ran 350,000 iterations of a Gibbs sampler, where the first 100,000 samples were discarded as burn-in, and the remaining samples were thinned at a thinning interval of 50. Thus, 5,000 posterior samples were used to infer the posterior distribution features. We conducted convergence diagnosis of the Gibbs sampler using a criterion of the accuracy of estimation of a quantile using the R package coda [34].

The $\:}^$ and $\:}_\text\text}^$ were then estimated by$\:}^=\frac}_}^}}_}^+}_}^}$ and $\:}_\text\text}^=\frac}_}^}}_}^+}_}^}$, and the still-missing heritability was calculated by subtracting $\:}_\text\text}^$ from $\:}^$.

T2D risk prediction accuracy using ERV and GERV

T2D risk prediction was performed using three models: (i) model (1) above, (ii) model (2) above, and (iii) a fixed model as control, which is comparable to model (1) above, but without the random genetic effect term $\:\mathbf$.

A 10-repeated 5-fold cross-validation (CV) was conducted to evaluate the prediction models’ performance in each scenario. In each repeat, we randomly divided individuals into 5 subsamples. In each fold of the CV, each subsample was considered the testing set and others the training set. The process followed until every 5 subsets were placed in the testing set precisely once. The entire process was repeated 10 times to reduce the variance of the estimated prediction accuracy. In each CV, for each model, the number of iterations of the Gibbs sampler was 210,000, with the first 60,000 samples discarded as burn-in. A thinning interval of 30 was used.

A receiver operating characteristic (ROC) analysis was used to compare the implemented models under different scenarios based on the area under curve (AUC), sensitivity, and specificity.

Construction and validation of T2D PRS

In the data, we calculated the PRS based on the summary statistics of a previous multi-ancestry GWAS conducted by A Mahajan, et al. [35], which is publicly available on the Diabetes Meta-Analysis of Trans-Ethnic association studies (DIAMANTE) Consortium data download website (http://diagram-consortium.org/downloads.html). Participants in the present study were independent of the DIAMANTE Consortium participants. From the summary statistics data, we removed SNPs with low info-score < 0.8, SNPs with MAF < 0.01, ambiguous SNPs, and SNPs on sex chromosomes.

PRSice software v2.3.3 [36] was used for PRS calculation using the commonly implemented method, LD clumping and p-value thresholding (C + T). As a default setting of the software, we adopted the additive model for PRS calculation, which sums up the dosage of the effective allele carried by an individual (ranging from 0 to 2). PRS was calculated for each individual using the following formulae:

where $\:_$ is the effective allele of the $\:j$th SNP, which derived from DIAMANTE Consortium GWAS, as the base data, $\:_$ is the number of the effective allele of the $\:j$th SNP for the $\:i$th individual, $\:_$ is the number of alleles included in the PRS for the $\:i$th individual, and $\:n$ is the number of SNPs.

To implement C + T, LD clumping was first performed using pairwise LD of $\:}^$ < 0.25 within 200 kb windows. Then, we calculated the PRS in the target data (n = 11,818) using different variant sets based on p-value thresholds between 1 and $\:5.0\times\:^$, Model 1 to Model 10, including age, sex, and the first 10 principal components (PCs), to account for population stratification (Table 1). The AUC metric was implemented to measure the capability of the model in discriminating between those having and not having T2D. The value of AUC closer to 1 is good in terms of discriminative performance, and the value of AUC closer to 0.5 means the model’s performance is like random guessing. In addition to AUC, Nagelkerke’s R² was used to establish the part of the variance in the dependent variable that the independent variable in each model could account for. This metric gives us an understanding of the models. The better the value, the more the fit toward the data. The logistic regression models have been compared through these metrics and labelled from model 1 to model 2. Among the models, Model 2 showed the highest performance, with respect to AUC and Nagelkerke’s R² value being the largest and substantially higher than the rest (p < 0.001, #variants = 3026).

Table 1 Performance metrics of logistic regression models (model 1 to Model 10) based on AUC and Nagelkerke’s R². The table highlights each model’s discriminative ability and explanatory power, with PT7 demonstrating superior performance across both metrics

To compare the prediction ability of PRS, ERV, and GERC, we calculated Spearman’s correlation between estimated ERVs or GERVs and predicted PRS on the target under the three different scenarios for family structure.

Simulation study

A simulation analysis was performed based on the TLGS dataset to quantify the difference between estimates of $\:}^$ and $\:}_\text\text}^$ in the three familial structure scenarios. We randomly selected 250 SNPs from the genotype profile of the population and considered them as quantitative trait loci, assigning reference alleles as the effect alleles. The quantitative phenotype of each individual ($\:_$, for the ith individual) with a low ($\:0.1$), moderate ($\:0.3$), and high ($\:0.5$) heritability ($\:_^$) was simulated under an additive model by summing the effect sizes of all effect alleles using:

where $\:_$ is the number of effect alleles carried by the ith individual at the jth randomly selected SNP, $\:_$ is the simulated effect of the jth SNP, and $\:_\sim\:N\left(_\right|0,_^)$ are i.i.d. standard normal residuals, where $\:_^$ was set to ($\:1-_^$) to ensure the desired level of heritability of the trait. Different studies have identified various numbers of SNPs (up to more than 400) significantly associated with T2D [35, 37, 38]. Based on the genetic structure of the disease, we assumed a total of 250 SNPs with nonzero effect. Among the candidate genes for T2D, a small number of them were highly replicated in T2D association studies [39]. Moreover, evidence from additional studies, such as the European and East Asian T2D GWAS meta-analyses [40, 41], suggests that specific SNPs located near critical genes associated with T2D demonstrate notably high effect sizes. This made us decide to introduce a small number of SNPs as major quantitative trait loci. Consequently, we assumed 70% of the additive genetic variance was explained by 247 SNPs with a relatively small effect, and the remaining 30% was explained by 3 SNPs with a sizable effect. The simulated $\:_$ was sorted from big to small and the top $\:_$ converted to T2D status, with the same prevalence as observed in our TLGS population. All simulation processes were repeated ten times to obtain a robust estimate of the heritability.

View original article

HUMAN GENOMICS

分享书签

0 0 0 0 0 0 0

More from this channel

The effect of family structure on the still-missing heritability and genomic prediction accuracy of type 2 diabetes

留言 (0)