Identifying latent genetic interactions in genome-wide association studies using multiple traits

Motivation

Consider the trait \(Y_\) for \(j = 1,2,\ldots , n\) unrelated individuals with \(k=1,2, \ldots , r\) measurable traits. Suppose \(Y_\) depends on a biallelic locus with genotype \(X_j\) denoting the number of minor alleles for the jth individual, an unobserved (or latent) variable \(M_j\), and a latent interaction \(X_j M_j\). The latent variable \(M_j\) can represent any type of interacting partner: a genotype at another locus, sex, age, or an environmental factor. For clarity, we will discuss \(M_j\) as an environmental factor and so the latent interaction is a genotype-by-environment (GxE) interaction.

We assume that these components contribute to trait expression additively in the following regression model:

$$\begin Y_ = \beta _X_ + \phi _ M_ + \gamma _ X_M_ + \epsilon _, \end$$

(1)

where \(\beta _\) is the effect size of the minor allele, \(\phi _\) is the effect size of the environmental variable, \(\gamma _\) is the effect size of the GxE interaction, and \(\epsilon _\) is an independent and identically distributed random error with mean zero and variance \(\sigma _^\). In this simplified setting, our goal is to detect the latent GxE interaction without observing the interacting variable \(M_\).

Under the above model assumptions, the latent GxE interaction will induce differential trait variance and covariance patterns that differ by genotype. Without loss of generality, assume the environmental variable has mean zero with unit variance. In Supplementary methods (Additional file 1), we show that the individual-specific trait variance (ITV) of the kth trait conditional on genotype is

$$\begin \text\left[ Y_ | X_j\right] = a_ + b_ X_j + c_ X_^, \end$$

(2)

where \(a_ = \phi ^_ + \sigma _^\), \(b_ = 2 \phi _ \gamma _\), and \(c_ = \gamma _^\). We also show that the individual-specific covariance (ITC) between the kth and \(k^\)th trait conditional on genotype is

$$\begin \text\left[ Y_, Y_} | X_\right] = \widetilde_} + \widetilde_} X_j + \widetilde_}X_^, \end$$

(3)

where \(\widetilde_} = \phi _ \phi _}\), \(\widetilde_} = \phi _\gamma _} + \phi _}\gamma _\), and \(\widetilde_} = \gamma_\gamma _}\). It is evident that a latent GxE interaction in trait k (\(\gamma _ \ne 0\)) not only induces a variance pattern that depends on genotype (Eq. 2; vQTL) but can also induce a covariance pattern between traits k and \(k^\) from either a shared interaction (\(\gamma _} \ne 0\)) or a shared environment involved in the interaction (\(\phi _} \ne 0\); Eq. 3; covQTL). These results suggest that we can test for loci with latent interactive effects by assessing whether the individual-specific trait variances (ITV) and covariances (ITC) differ by genotype without specifying or directly modeling the interacting variable \(M_\).

Latent Interaction Testing (LIT) framework

Our strategy builds from the above observations and estimates the ITV and ITC to detect non-additive biological signals that can be induced by latent genetic interactions. To derive estimates of these quantities, we first remove the additive genetic effect from the traits to ensure that any variance and covariance effects are not due to the additive effect. Let us denote the trait residuals as \(e_ = Y_ - \beta _X_\) where we assume the effect size is known for simplicity. We can then express the ITV and ITC as a function of these residuals: the ITV of trait k and the ITC between traits k and \(k^\) is defined as \(\text\mathit\vert X_j}\right]}\mathit=\textit\mathit^2\vert X_j}\right]}\) and \(\text\left[ Y_, Y_} | X_\right] = \text\left[ e_ e_} | X_\right]\), respectively (Additional file 1). Thus, we can estimate the ITV by squaring the residuals, \(e^_\), and estimate the ITC between traits k and \(k^\) by the pairwise product of the residuals (i.e., the cross products), \(e_ e_}\). Aggregating the ITV and ITC estimates across all individuals, we denote the cross product (CP) terms in the \(n\times s\) matrix \(\varvec^}\) where the jth row vector is \(\varvec^}_ = \left[e_e_, e_e_, \ldots ,e_ e_\right]\), and the squared residual (SQ) terms in the \(n\times r\) matrix \(\varvec^}\) where \(Z^}_ = e^_\).

Our inference goal is to assess whether the SNP, \(\varvec_ = [X_, X_, \ldots , X_] ^\), is independent of the squared residuals and cross products,

$$\begin \varvec^}_ \!\perp\!\!\!\perp \varvec &\quad \text\ \ \ q = 1, 2,\ldots , s,\ \ \ \text\\ \varvec^}_ \!\perp\!\!\!\perp \varvec&\quad \text\ \ \ k = 1, 2,\ldots , r, \end$$

(4)

where “\(\cdot\)” denotes all the rows (or individuals) and “\(\perp \!\!\!\!\perp\)” denotes statistical independence. In the above regression model, this corresponds to testing the global null hypothesis \(H_ : \gamma _ = \gamma _ = \ldots = \gamma _ = 0\) versus the alternative hypothesis \(H_ : \gamma _ \ne 0\) for at least one of the \(k = 1,2,\ldots , r\) traits. While a regression model can be directly applied to the squared residuals and cross products to test the global null hypothesis (see Additional file 1 for mathematical details), a univariate model approach does not adequately leverage pleiotropy and requires a multiple testing correction which reduces power.

To address these issues, we develop a new multivariate kernel-based framework, Latent Interaction Testing (LIT), that captures pleiotropy across the ITV and ITC terms to increase power for detecting latent interactions. There are three key steps in the LIT framework (Fig. 1):

1.

Regress out the additive genetic effects and any other covariates from the traits. Additionally, adjust the traits and genotypes for population structure.

2.

Calculate estimates of the ITV and ITC for each individual using the squared residuals and the cross products of the residuals, respectively.

3.

Test the global null hypothesis of no latent interaction by comparing the adjusted genotype(s) to the ITV and ITC estimates.

Fig. 1figure 1

Overview of the Latent Interaction Testing (LIT) framework. Given a set of r traits, \(\varvec\), and \(m_\) SNPs, \(\varvec\), the goal is to detect a latent genetic interaction involving the SNPs. The trait squared residuals (SQ) and cross products (CP), \(\varvec\), are calculated while adjusting for linear effects from the genotypes and any other covariates. The traits and genotypes are also adjusted for population structure. A similarity matrix for the genotypes, \(\varvec_}\), and the SQ and CP, \(\varvec_\), are calculated to construct a test statistic, T, which measures the overlap between the two matrices. Large values of T are evidence of a latent genetic interaction

We expand on the above steps in detail below.

Step 1: In the first step, LIT standardizes the traits and then regresses out the additive genetic effects, population structure, and any other covariates. This ensures that any differential variance and/or covariance patterns are not due to additive genetic effects or population structure. Suppose there are \(l_\) measured covariates and \(l_\) principal components to control for structure. We denote these \(l = l_+ l_\) variables in the \(n \times l\) matrix \(\varvec\). After regressing out these variables and the additive genetic effects, the \(n \times r\) matrix of residuals is \(\varvec=\widetilde} - \varvec\widehat} - \varvec\widehat}\), where \(\widetilde}\) is the standardized trait matrix, \(\widehat}\) is a \(1\times r\) matrix of effect sizes, and \(\widehat}\) is a \(l\times r\) matrix of coefficients estimated using least squares. We also regress out population structure from the genotypes which we denote by \(\widetilde}\).

The above approach only removes the mean effects and does not correct for variance effects from population structure which can impact type I error rate control [32]. A strategy to adjust for the variance effects is to standardize the genotypes with the estimated individual-specific allele frequencies (IAF), i.e., the allele frequencies given the genetic ancestry of an individual. However, it is computationally costly to standardize the genotypes for biobank-sized datasets as it requires estimating the IAFs of all SNPs using a generalized linear model [33, 34]. Therefore, in this work, we remove the mean effects from structure and then adjust the test statistics with the genomic inflation factor to be conservative. Our software includes an implementation to standardize the genotypes using the IAFs for smaller datasets.

Step 2: The second step uses the residuals, \(\varvec\), to reveal any non-additive biological signals by constructing estimates of the ITV and ITC. For the jth individual’s set of trait residuals, the ITVs are estimated by squaring the trait residuals while the ITCs are estimated by calculating the cross products of the trait residuals. We express the squared residuals as \(\varvec^}_ = \left[ e^_, e^_, \ldots , e^_\right]\), and the \(s = \) pairwise cross products as \(\varvec^}_ = \left[e_ e_, e_ e_, \ldots , e_ e_\right]\). Importantly, when the studentized residuals are used, then \(\varvec^}_\) and \(\varvec^}_\) represent an unbiased estimate of the ITVs and ITCs, respectively. We aggregate these terms across all individuals into the \(n \times (r+s)\) matrix \(\varvec = \left[\varvec^}\ \ \varvec^}\right]\).

Step 3: In the last step, we test for association between the adjusted SNP and the squared residuals and cross products (SQ/CP) using a kernel-based distance covariance framework [35,36,37]. Specifically, we apply a kernel-based independence test called the Hilbert-Schmidt independence criterion (HSIC), which has been previously used for GWAS data (see, e.g., [38,39,40,41]). The HSIC generalizes many well-known testing procedures in statistics; for example, depending on the kernel function choice, the RV coefficient [42], distance covariance [43], and multivariate distance matrix regression (MDMR) [44] can be expressed as special cases of the HSIC. Due to such flexibility, it is implemented in many testing procedures in genetics (e.g., SKAT [40]). The HSIC constructs two \(n\times n\) similarity matrices between individuals using the SQ/CP matrix and genotype matrix, then calculates a test statistic that measures any shared signal between these similarity matrices. To estimate the similarity matrix, a kernel function is specified that captures the similitude between the jth and \(j^\)th individual.

Since our primary application is biobank-sized data, we use a linear kernel so that LIT is computationally efficient. The linear similarity matrix is defined as \(K_} := k\left( \widetilde_, \widetilde_}\right) = \widetilde_ \widetilde_}\) for the genotype matrix and \(L_}:= k\left( \varvec_, \varvec_}\right) = \varvec_ \varvec_}^\) for the SQ/CP matrix. The linear kernel is a scaled version of the covariance matrix and, for this special case, the HSIC is related to the RV coefficient. While our theoretical results indicate that the variance (Eq. 2) and covariance (Eq. 3) models include a quadratic term for the genotypes, the expected effect size of an interaction in a GWAS suggests that the linear term will dominant the variance and/or covariance signal compared to the quadratic term. Therefore, we only consider the linear term in this work. We note that one can choose other options for a kernel function, such as a polynomial kernel, projection kernel, and a Gaussian radial-basis function that can capture non-linear relationships [41, 45].

Once the similarity matrices \(\varvec\) and \(\varvec\) are constructed, we can express the HSIC test statistic as

$$\begin T = \frac \textrm(} }), \end$$

(5)

which follows a weighted sum of chi-squared random variables under the null hypothesis, i.e., \(T\mid H_ \sim \sum _^ \frac \lambda _ \lambda _ v_^\), where \(\lambda _\) and \(\lambda _\) are the ordered non-zero eigenvalues of the respective matrices and \(v_ \sim \textrm(0,1)\). Intuitively, the test statistic measures the "overlap" between two random matrices where large values of T imply the two matrices are similar (i.e., a latent genetic interactive effect) while small values of T imply no evidence of similarity (i.e., no latent genetic interactive effects). We can approximate the null distribution of T using Davies’ method, which is computationally fast and accurate for large T [40, 41, 46].

For the linear kernel considered here, we implement a simple strategy to substantially improve the computational speed of LIT. We first calculate the eigenvectors and eigenvalues of the SQ/CP and genotype matrices to construct the test statistic. Since the number of traits, r, is much smaller than the sample size, n, we can perform a singular value decomposition to estimate the subset of eigenvectors and eigenvalues in a computationally efficient manner [47,48,49]. This allows us to circumvent direct calculation and storage of large \(n\times n\) similarity matrices. Let \(\varvec = \varvec_ \varvec_ \varvec_^\) and \(\varvec = \varvec_ \varvec_ \varvec_^\) be the singular value decomposition (SVD) of the similarity matrices where the matrix \(\varvec\) is a diagonal matrix of eigenvalues and \(\varvec\) is a matrix of eigenvectors of the respective kernel matrices. We can then express the test statistic in terms of the SVD components as \(T = \frac}\left( \varvec_ \varvec\varvec_\varvec^\right)\), where \(\varvec = \varvec_^ \varvec_\) is the outer product between the two eigenvectors. Thus, for a single SNP, the test statistic is \(T = \frac\text\left( \varvec_ \varvec_\times d_} \varvec_ \varvec^_\times d_}\right)\), where \(d_= r+s\) is the rank of the SQ/CP matrix, \(d_ = 1\) is the rank of the genotype matrix, and  \(d_, d_ \ll n\).

Aggregating different LIT implementations using the Cauchy combination test

We explore an important aspect of the test statistic in Eq. 5, namely, the role of the eigenvalues in determining statistical significance. The above equations suggest that the eigenvalues of the kernel matrices are emphasizing the eigenvectors that explain the most variation in the test statistic. While this may be reasonable in some settings, the interaction signal can be captured by eigenvectors that explain the least variation and this can be very difficult to ascertain beforehand [50]. In this case, the testing procedure will be underpowered. Thus, we also consider weighting the eigenvectors equally in LIT, i.e., \(T = \frac\text\left(\varvec\varvec^\right) = \frac\sum\nolimits_^D^_\), where \(D_\) are the eigenvalues of the outer product matrix. In this work, we implement a linear kernel (scaled covariance matrix) and so, in this special case, weighting the eigenvectors equally is equivalent to the projection kernel.

In summary, there are two implementations of the LIT framework. The residuals are first transformed to calculate the SQ and CP to reveal any latent interactive effects. We then calculate the weighted and unweighted eigenvectors in the test statistic which we refer to as weighted LIT (wLIT) and unweighted LIT (uLIT), respectively. We also apply a Cauchy combination test (CCT) [51] to combine the p-values from the LIT implementations to maximize the number of discoveries and hedge for various (unknown) settings where one implementation may outperform the other. More specifically, let \(p_\) denote the p-value for the \(c=1,2\) implementations. In this case, the CCT statistic is \(T^ = \frac \sum _^ \tan \left\ )\pi \right\}\), where \(\pi \approx 3.14\) is a mathematical constant. A corresponding p-value is then calculated using the standard Cauchy distribution. Importantly, when applying genome-wide significance levels, the CCT controls the type I error rate under arbitrary dependence structures. We refer to the CCT p-value as aggregate LIT (aLIT).

Incorporating multiple loci in LIT

We can extend LIT to assess latent interactions within a genetic region (e.g., a gene) consisting of multiple SNPs. In the first step, we regress out the joint additive effects from the multiple SNPs along with any other covariates and population structure. In the second step, we calculate the squared residuals and cross products using the corresponding residual matrix. Finally, in the last step, we construct the similarity matrices and perform inference using the HSIC: the linear similarity matrix for the \(n \times m_\) genotype matrix \(\widetilde}\) is \(K_} = k\left( \widetilde}_, \widetilde}_}\right) = \widetilde}_ \widetilde}_}^\) and our test statistic is \(T = \frac\text\left( \varvec_ \varvec_\times d_} \varvec_ \varvec^_\times d_}\right)\) where \(d_ = m_\) is the rank of the genotype matrix.

Compared to the previous section, this extended version of LIT is a region-based test for interactive effects instead of a SNP-by-SNP test. A region-based test is advantageous to reduce the number of tests compared to a SNP-by-SNP approach and enable testing of rare variants [41]. However, in this work, we perform a SNP-by-SNP genome-wide scan with LIT to demonstrate the scalability.

Simulation study

We evaluated the performance of LIT using simulated data with the following assumptions. Let the individual-specific minor allele frequencies of \(t = 1, 2, \ldots , m\) biallelic genotypes be denoted by \(\pi _\). Of the m SNPs, \(m-1\) SNPs had no interacting partner and a minor allele frequency drawn from a \(\textrm(0.1,0.4)\). The SNP with an interacting partner had a minor allele frequency of 0.25. We fixed this MAF to remove stochastic variation in the observed power induced by simulations differing only by the MAF of the interacting SNP. The genotypes were then drawn from a Binomial distribution with parameter \(\pi _\), i.e., \(X_ \sim Binomial(2,\pi_)\). In total, there were \(n=300000\) individuals simulated to reflect biobank-sized GWAS.

We simulated the trait expression value \(Y_\) for \(k = 1,2,\ldots ,r\) traits under the polygenic trait model with two risk environmental variables \(M_\) and \(W_\). Specifically, there were \(r = 5, 10\) traits and \(m=100\) genotypes simulated with an additive genetic, environmental, and GxE components:

$$\begin Y_ = \alpha _ + \beta _ X_ + \phi _ M_ + \gamma _ M_ X_ + \sum \limits _^ \beta _ X_ + W_ + \epsilon _, \end$$

(6)

where the intercept, \(\alpha _\), follows a normal distribution with a standard deviation of 5; the effect sizes of the GxE interaction, \(\gamma _\), interacting environment, \(\phi _\), and additive genetic component, \(\beta _\), follow a normal distribution with mean zero and standard deviation of 0.01; the two environmental variables were generated from a standard normal distribution where only one interacts with the risk allele; and the error term was generated from a standard normal distribution. Using the above model, we considered different types of pleiotropy. First, we assigned the effect size direction of the additive genetic component, interacting environment, and the GxE interaction to be the same in each trait. We then considered cases where the effect size for the shared GxE interaction is in the same direction (i.e., \(|\gamma _|\)) and random directions across traits. These settings represent positive pleiotropy and a mixture of positive and negative pleiotropy, respectively. We also considered a variation of the above settings where the direction of the effect size for the GxE interaction is opposite of the interacting environment.

We transformed the components in the model using the function \(f(x) = \dfrac_}_}\), which takes a vector x and standardizes it by the estimated mean and standard deviation. We scaled each component to set the baseline correlation between traits (ignoring the risk factor, interactive environment, and GxE interaction) as 0.25, 0.50, and 0.75. In particular, the percent variance explained of the non-interactive environment was \(15\%\) and the additive genetic component (minus the risk factor) was \(10\%\), \(35\%\), and \(60\%\), which represents a 0.25, 0.50, and 0.75 baseline correlation between traits, respectively. We then assigned the percent variance explained for the additive genetic risk factor as \(0.2\%\), the interactive environment as a uniformly drawn value from \(0.5\) to \(2.0\%\), the GxE interaction as a uniformly drawn value from \(0.1\) to \(0.15\%\), and the remaining variation as noise.

In our simulation study, we also varied the proportion of traits with an interaction term. For r traits, let \(\tau _\) denote the proportion of traits with a shared GxE interaction signal. We varied this proportion as \(\tau _ = \frac, \frac,\ldots , 1\). At each combination of baseline trait correlation, number of traits, and proportion of null traits, we generated data from the above polygenic trait model 500 times for each pleiotropy setting. We calculated the empirical power by averaging the total number of times the p-values were below a significance threshold of \(\alpha = 5 \times 10^\). Under the null hypothesis of no GxE interaction, we assessed the type I error rate at \(\alpha = 1 \times 10^\) using 50 simulated datasets with 10,000 SNPs where the traits do not have a GxE interaction. We also considered cases where the random error follows a Chi-squared distribution with five degrees of freedom and a t-distribution with three degrees of freedom under the null hypothesis.

UK Biobank

The UK Biobank is a collaborative research effort to gather environmental and genetic information from half a million volunteers 40–69 years old in the UK. The data was collected across 22 assessment centers from 2006 to 2010 where participants were given a general lifestyle and health questionnaire, a physical examination, and a blood test that provided genetic data [52, 53]. See ref. [54, 55] for detailed information on the study design.

We applied LIT to four obesity-related traits, namely, waist circumference, hip circumference, body mass index, and body fat percentage. We restricted our analysis to unrelated individuals with British ancestry and removed any individuals with a sex chromosome aneuploidy. Using the imputed genotypes (autosomes only), SNPs were filtered in PLINK [56] with the following thresholds: a MAF of \(0.05\), a genotype missingness rate of \(0.05\), Hardy-Weinberg equilibrium (defined as \(10^\)), and an INFO score of \(0.9\). The traits were adjusted for age and the top 20 principal components provided by the UK Biobank to account for ancestry. We removed individuals with measurements that were four standard deviations above the average and then standardized the traits by sex. After filtering, there were 329,146 individuals and 6,186,503 SNPs in our analysis. To calculate the genomic inflation factor of our LIT analyses, we identified a subset of 34,643 “independent” SNPs using the argument —index-pairwise 500 5 0.05 in PLINK.

To demonstrate our procedure controls the type I error rate using the UK Biobank data, we implemented the double Kolmogorov-Smirnov (KS) testing framework as proposed by Leek and Storey (2011) [57] as a diagnostic tool. The double KS test is implemented as follows. For 100 permutations of the phenotypes, we apply the LIT framework to calculate the p-values for all of the “independent” SNPs (used to estimate the genomic inflation factor) under the null hypothesis. We then evaluate the joint behavior of the set of p-values within each permuted dataset by applying a KS test to assess how “close” the observed p-values are to a uniform distribution. Under the null hypothesis for a well-behaved testing procedure, the p-values of this first KS test follow a uniform distribution across the 100 permutations. The next step is to implement a second KS test on the 100 KS test p-values to test how “close” the set of 100 KS test p-values are to a uniform distribution. Under the null hypothesis, the joint behavior of the KS test p-values across the permutations will also follow a uniform distribution. The double KS test simultaneously assesses the joint behavior of the p-values within each permutation and the marginal p-values across permutations.

留言 (0)

沒有登入
gif