The X factor: A robust and powerful approach to X‐chromosome‐inclusive whole‐genome association studies

1 INTRODUCTION

Genome-wide association studies (GWAS) are ubiquitous, delivering significant insights into the genetic determinants of complex traits over the past decade (Visscher et al., 2017). For this reason, it is surprising that it is not a common practice to include the X-chromosome in GWAS (Konig et al., 2014; Wise et al., 2013). The X-chromosome differs from the autosomes in that males have only one copy of the X-chromosome while females have two, and at any given genomic location one of the two copies in females may be silenced (Gendrel & Heard, 2011), referred to as X-chromosome inactivation (XCI). The choice of the silenced copy could be random or skewed toward a specific copy (Wang et al., 2014). These unique aspects lead to more complex analytic considerations for genetic association analysis of X-chromosomal variants, such as bi-allelic single-nucleotide polymorphisms (SNPs).

A bi-allelic SNP has two alleles, urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0001 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0002, of which one is the reference allele and the other is the alternative allele with allele frequency urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0003. An autosomal SNP has three genotypes regardless of sex, namely urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0004. In association analysis of an autosomal SNP, the common practice is to simply model a binary or continuous phenotype urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0005 as an additive function of the number of copies of the non-baseline allele present in urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0006; that is, coding urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0007 additively as urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0008. Here, without loss of generality, urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0009 is chosen to be the baseline allele in a statistical model and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0010 the non-baseline allele. When urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0011 is binary, this regression-based additive test is also equivalent to the Cochran–Armitage trend test (Wellek & Ziegler, 2012). Although both dominant and recessive genetic models of inheritance are possible, among these one degrees of freedom (1 df) models, a common practice for GWAS is to use the additive model, because it has reasonable power to detect both additive and dominance effects at a causal variant, and at variants in linkage disequilibrium (LD) with the causal variant (Bush & Moore, 2012; Hill et al., 2008). An alternative parameterization is the 2 df genotypic model that includes both the additive urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0012 term and the dominance urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0013 term. In the case of recessive genetic inheritance, Zhou et al. (2017) showed that the 2 df genotypic test outperforms the 1 df additive test for binary outcomes, and Dizier et al. (2017) reached the same conclusion for continuous traits. In the case of additive genetic inheritance being true, the genotypic test is known to be less powerful than the additive test due to the increased df, which is unnecessary. The preferred test for unknown genetic inheritance in terms of power and robustness to different genetic models is, however, not clear across different true genetic effect sizes, sample sizes, and significance levels.

For an X-chromosomal SNP, the most commonly used approach assumes additivity and XCI. However, recent work (Tukiainen et al., 2017) showed that up to one-third of X-chromosomal genes are expressed from both the active and inactive X-chromosomes in female cells, with varying degrees of “escape” from inactivation between genes and individuals. Several additional points also require attention. Table 1 describes eight analytical considerations and challenges (C1–C8) present in an X-chromosome-inclusive GWAS, including a method's suitability for analyzing both binary and continuous traits (C1), which is related to the type of method used, that is, genotype-based or allelic association tests (C2); the (under-appreciated) consequence of the choice of the baseline allele on association analysis of an X-chromosomal SNP (C3); the importance of including sex as a covariate (C4) and its analytical connection with C3; the value in considering gene-sex interaction effect (C5) and its connection with the assumption of XCI (C6); and the assumption of random versus skewed XCI (C7) and its connection with nonadditive effects (C8).

Table 1. Eight analytical considerations and challenges, C1–C8, present in X-chromosome-inclusive association studies Problem Solution Relevant sections C1: Quantitative traits vs. binary outcomes 

C2: Genotype-based vs. allele-based association methods

Allele-based association tests, comparing allele frequency differences between cases and controls, are locally most powerful. However, they analyze binary outcomes only and are sensitive to the Hardy–Weinberg equilibrium (HWE) assumption (Sasieni, 1997).

Genotype-based regression models, urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0014-on-urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0015, support various types of outcome data, account for covariate effects with ease, and are robust to the HWE assumption. Sections 1 and 2

C3: The choice of the baseline allele for association analysis, r vs. R

For the autosomes, switching the two alleles does not affect the association inference. Is this true for the X-chromosome?

It is not always true for the X-chromosome, unless urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0016 is included in the model. Sections 2.1 and 2.2, and C4

C4: Sex as a covariate vs. no S main effect

Unlike the autosomes, sex is a confounder when analyzing the X-chromosome for traits exhibiting sexual dimorphism (e.g., height and weight). Even for the autosomes, sex can be a confounder if allele frequencies differ significantly between males and females.×

To maintain the correct type I error rate control, the sex main effect must be considered particular when analyzing the X-chromosome. The resulting association test is also invariant to the choice of the baseline allele. Section 2.2 and C3

C5: Gene–sex interaction vs. no G × S interaction effect

Gene–sex interaction might exist, but there is a concern over loss of power due to increased degrees of freedom. In addition, what is the interpretation of gene–sex interaction effect in the presence of X-inactivation?

Under no interaction, power loss of modeling interaction is capped at 11.4%. Models including the urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0017 covariate also lead to tests invariant to the assumption of X-chromosome inactivation status. Sections 2.3 and 3, and C6

C6: X-chromosome inactivation (XCI) vs. no XCI

XCI occurs if one of the two alleles in a genotype of a female is silenced. Individual-level XCI status requires additional biological information that are not typically available to genetic association studies. Assuming XCI or no XCI at the sample level leads to different genotype coding strategies (Table 2), and it was thought that this will always lead to different association results.

XCI uncertainty implies sex-stratified genetic effect which can be analytically represented by the urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0018 interaction effect. Teasing apart these different biological phenomenon require other “omic” data and additional analyses. Sections 2.3 and 5, and C5

C7: If XCI, random vs. skewed X-inactivation

If the choice of the silenced allele in females is skewed toward a specific allele, the average effect of the urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0019 genotype is no longer the average of those of urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0020 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0021.

XCI skewness is statistically equivalent to a dominance genetic effect. Section 2.4, and C8

C8: Dominance effect vs. no GD dominance effect

For both the autosomes and X-chromosome, the most common practice is to use the additive test which has better power than the genotypic test under (approximate) additivity, but it cannot capture dominance effects. The exact trade-off, however, is not clear.

We provide analytical and empirical evidence supporting the use of genotypic model when analyzing either the autosomes or X-chromosome. For an X-chromosomal variant, including the dominance effect term has the added benefit of resolving of the skewed X-inactivation uncertainty issue. Sections 2.43 and 4, and C7 Table 2. Covariate coding schemes for examining the additive, dominance, gene–sex interaction, and sex effects under different assumptions of the X-chromosome inactivation status and the choice of the baseline allele Effect interpretation Covariate notation Non-baseline allele X-chromosome inactivation (XCI) status Coding schemes Females Males urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0022 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0023 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0024 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0025 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0026 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0027 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0028 Yes 0 0.5 1 0 1 Additive GA urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0029 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0030 Yes 1 0.5 0 1 0 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0031 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0032 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0033 No 0 1 2 0 1 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0034 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0035 No 2 1 0 1 0 Dominance urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0036 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0037 Either Either 0 1 0 0 0 Gene–sex interaction urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0038 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0039 Either 0 0 0 0 1 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0040 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0041 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0042 Either 0 0 0 1 0 Sex urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0043 urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0044 Either Either 0 0 0 1 1 Note: The subscripts urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0045 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0046 represent additive and dominance effects, urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0047 or urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0048 represents the non-baseline allele of which we count the number of copies present in a genotype, and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0049 or urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0050 denotes X-chromosome inactivated or not inactivated.

Several association methods have been developed for the X-chromosome, and they are computationally efficient for conducting X-chromosome-wide association analysis. However, each method solves only some of the C1–C8 challenges. For example, Zheng et al. (2007) considered only binary outcomes for which both genotype- and allele-based association tests are applicable. The classical allelic association test, comparing allele frequencies between case and control groups, is locally most powerful but sensitive to the Hardy–Weinberg equilibrium (HWE) assumption and not applicable to continuous traits (Sasieni, 1997; Zhang & Sun, 2021; Zheng, 2008). Clayton (20082009) discussed analytical strategies assuming the X-chromosome is always inactivated. Hickey and Bahlo (2011) and Loley et al. (2011) performed simulation studies, each providing a thorough method comparison, for example, between tests of Zheng et al. (2007) and Clayton (2008). Konig et al. (2014) gave detailed guidelines for including the X-chromosome in GWAS, recommending different tests for different model assumptions (e.g., presence or absence of an interaction effect or XCI), but it is difficult to validate these assumptions in practice. Gao et al. (2015) developed a toolset for conducting X-chromosome association studies, implementing some of the existing methods. More recently Z. Chen et al. (2017) improved sex-stratified analysis by eliminating genetic model assumptions, but their method is limited to analyzing genetic main effects on binary traits. Focusing on XCI uncertainty, Wang et al. (2014) proposed a frequentist maximum likelihood solution to deal with no, random or skewed X-inactivation, and in their follow-up work Wang et al. (2017) provided a model selection method. In contrast, B. Chen et al. (2020) applied the Bayesian model averaging principle (Draper, 1995) to deal with the XCI uncertainty problem. However, these approaches assumed additive genetic effects. The value in considering dominance and gene-sex interaction effects, and the inferential consequence of defining different baseline allele (i.e., the reference or the alternative allele) when analyzing an X-chromosomal SNP, have received little to no attention.

Here we propose a theoretically justified and robust X-chromosome association method that can simultaneously deal with all eight challenges (C1–C8) outlined in Table 1. We emphasize the robustness of the proposed method to genetic assumptions as our understanding is evolving. For example, although most published X-chromosome-inclusive GWAS assumed XCI, recent work has shown that up to a third of genes “escape” XCI (Tukiainen et al., 2017).

The proposed method is regression- and genotype-based (robust to departure from HWE), analyzing either a continuous or binary trait while adjusting for covariate effects. The recommended test has three degrees of freedom, including both additive and dominance genetic effects, as well as a gene–sex interaction effect. We show analytically why the proposed method is robust to the various model uncertainties, including no, random or skewed XCI, as well as the choice of the baseline allele. Desirably, the power of the proposed test is robust to different alternative genetic models, despite its increased degrees of freedom over a simple additive test. We note that the work here focuses on efficient association testing, not parameter estimation or model selection which requires additional biological data (Busque et al., 1996).

We first present our main theory to address the eight challenges associated with X-chromosome-inclusive GWAS in Section 2. We then provide analytical results of power study across all possible genetic models, sample sizes and type I error rates, as well as empirical results from simulation studies in Section 3. For methodology completeness, this section also briefly discusses merit of the genotypic model in the familiar context of analyzing autosomal SNPs. We then provide corroborating evidence from several applications in favor of the proposed approach in Section 4. Finally we discuss the limitations of our approach and possible future work in Section 5.

2 METHOD FOR X-CHROMOSOME-INCLUSIVE ASSOCIATION ANALYSIS

The proposed method relies on the generalized linear model (McCullagh & Nelder, 1989) as it is flexible, analyzing both binary and continuous traits (C1 of Table 1). As a result, the method is a genotype-based approach (C2) that is robust to the assumption of HWE by regressing the phenotype data (urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0051) on genetic data (urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0052) while accounting for other covariate effects.

For robust and powerful association analysis of a bi-allelic X-chromosomal SNP, we recommend the following model: urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0053(1) and the corresponding 3 df test, jointly testing urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0054(2) where notations for the covariates are defined in Table 2. Other relevant covariates such as environmental factors (urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0055s) should also be included in the model but omitted here for notation simplicity.

We show later (a) why the association result from the proposed approach is invariant to the different urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0056 (e.g., urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0057 or urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0058) and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0059 (e.g., urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0060 or urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0061) coding schemes as defined in Table 2, and (b) why the proposed method also solves the C3–C8 issues simultaneously. But before we do so, we first provide more details about the notations presented in Table 2.

2.1 X-chromosome specific genotype and covariate coding schemes

Table 2 summarizes the various covariate coding schemes for analyzing an X-chromosomal SNP, when considering all the analytical challenges outlined in Table 1. Note that when the choice of the baseline allele is varied (i.e., either urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0062 or urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0063) and the XCI status is unknown, there are four ways to code the additive covariate urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0064, and two ways to code the gene–sex interaction covariate urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0065. The specific coding for sex does not have an impact on our proposed method. In Table 2, a female is coded as 0 and a male as 1, and the interaction urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0066 term vanishes. If a female were coded as 1 and a male as 0, then urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0067 is the same as urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0068. Thus, in either case it is redundant to include urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0069 in our proposed regression model.

Using the notations in Table 2, it is immediately clear why the choice of the baseline allele (C3) matters for association analysis of an X-chromosomal SNP. Under no XCI, if urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0070 were assumed to be the baseline allele there would be one copy of allele urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0071 in genotype urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0072 of a female, and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0073 of a male. Thus, genotypes urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0074 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0075 would be grouped together for association analysis. However, if urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0076 were chosen to be the baseline allele, genotypes urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0077 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0078 would be grouped together, resulting in different inference. In contrast, the choice of the baseline allele does not affect association evidence when analyzing an autosomal SNP. It is well-known that although the estimate of the effect size changes direction, the magnitude of the association remains the same when analyzing an autosomal SNP. But, this is not always true when analyzing an X-chromosomal SNP.

2.2 Sex as a confounder (C4) and its connection with the choice of the baseline allele (C3)

Sex is a confounder for phenotype-genotype association analysis of an X-chromosomal SNP for traits displaying sexual dimorphism. When sex, but not the SNP, is associated with a trait of interest, omitting sex in the analysis leads to false positives. This is because sex is inherently associated with the genotypes of an X-chromosomal SNP (Table 2); see Ozbek et al. (2018) for empirical evidence from simulation studies. Thus, accuracy of a test provides the first argument for always including urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0079 as a covariate in association analysis of an X-chromosomal SNP.

The second advantage of modeling the urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0080 main effect is more subtle. As shown in Table 2, the coding of urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0081 depends on the choice of the baseline allele (i.e., urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0082 or urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0083) and the X-inactivation status (urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0084 for XCI and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0085 for no XCI), resulting in a total of four different ways of coding the five genotype groups, namely urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0086 (0, 0.5, 1, 0, 1)′, urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0087 (1, 0.5, 0, 1, 0)′, urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0088 (0, 1, 2, 0, 1)′, and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0089 (2, 1, 0, 1, 0)′. Furthermore, urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0090 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0091 yield different test statistics, because the two coding schemes lead to different groupings of the genotypes as discussed in 2.1. Note that, in contrast to urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0092 under XCI, under no XCI there is no linear transformation that makes urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0093 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0094 equivalent. An inference that is invariant to the coding choices may seem difficult, but we show that this is achievable for models that include sex as a covariate.

Theorem 1.Let urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0095 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0096 be two generalized linear models (McCullagh & Nelder, 1989) with the same link function urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0097 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0098, where urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0099 is the response vector of length urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0100 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0101 are two urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0102 design matrices, and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0103 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0104 are the corresponding parameter vectors of length urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0105. Let urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0106, where urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0107 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0108 are urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0109 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0110 matrices corresponding to, respectively, the urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0111 secondary covariates not being tested and the urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0112 primary covariates of interest, and similarly for urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0113, and partition the regression coefficients accordingly as urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0114 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0115. If there exists an invertible urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0116 matrix

urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0117 where urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0118 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0119 are, respectively, invertible urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0120 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0121 matrices, then any of the Wald, Score or LRT tests for testing urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0122 are identical under the two models urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0123 and urn:x-wiley:07410395:media:gepi22422:gepi22422-math-0124, resulting in the same association inference for evaluating the

留言 (0)

沒有登入
gif