Testing and estimation of X‐chromosome SNP effects: Impact of model assumptions

1 INTRODUCTION

X chromosome inactivation (XCI) is an epigenetic process that leads to inactivation of one of the two X chromosomes in females (Ross et al., 2005). It prevents females from having twice as much X chromosome gene expression as males, who only have one X chromosome. Generally, XCI is believed to be random, but research has shown that XCI can be skewed in some individuals, meaning that one chromosome is more likely to be silenced than the other (Heard & Disteche, 2006). Also, approximately 9%–14% of X-linked genes escape from XCI and both copies are expressed (Carrel & Willard, 1999). These complexities have implications for modeling X chromosome variant effects on phenotypes and have resulted in the X chromosome often being excluded from genome-wide association study (GWAS) analyses (Wise et al., 2013).

Sex differences in prevalence or clinical presentation have been observed in many diseases including autoimmune disorders, cardiovascular diseases, cancer, and psychiatric disorders (Appelman et al., 2015; Dong et al., 2020; Erol et al., 2015; Riecher-Rössler, 2017; Voskuhl, 2011), leading to hypotheses that X chromosome genetic variation may contribute to some of these differences. Therefore, interest in analyzing X chromosome variants is growing and researchers are beginning to incorporate X chromosome variants into GWAS (Chang et al., 2014; Khramtsova et al., 2019; MacArthur et al., 2017; Stahl et al., 2019). However, often, little attention is paid to XCI biology when analyzing X chromosome genotype data (Chen et al., 2020; Wang et al., 2017). A number of approaches have been proposed to analyze single nucleotide polymorphisms (SNPs) on the X chromosome, and several studies have compared different approaches (Clayton, 2008; Gao et al., 2015; König et al., 2014; Özbek et al., 2018; Wang et al., 2014; Xu & Hao, 2018). Most studies have focused on power when presenting results and comparing approaches, whereas biases and the interpretability of coefficients have not been well investigated and discussed. As risk prediction methods such as polygenic risk scores (PRS) increase in popularity (Lambert et al., 2019; Lloyd-Jones et al., 2019), a focus on the properties of SNP effect estimates, including coefficient biases and concerns regarding interpretability, will become more important to integrate X chromosome variants into these predictive models.

We previously proposed using an XCI-robust approach for X chromosome genetic variant analysis, by including a sex–SNP interaction term, which allows for different effects of the genetic variants between males and females, and thus, increases the flexibility of the model. We used this approach in a study of bipolar disorder (Jons et al., 2019); but as with other methodological approaches, discussion of the utility of the model primarily focused on power to detect effects. In this study, we performed simulations to examine the impact of model assumptions and SNP coding scheme on biases of model coefficients, in scenarios with and without sex differences in SNP effect, and using models with and without SNP × Sex interaction terms in the analysis. We also illustrate our key findings through an example of an X chromosome SNP previously shown to be associated with body mass index.

2 METHODS 2.1 Common SNP coding schemes

Because of the underlying biological process of XCI, two genotype coding schemes are commonly used in the analysis of X chromosome SNP data; one corresponds to an assumption of “XCI” (formerly referred to as “Clayton”) (Clayton, 2008) while the other is consistent with escape from XCI (“eXCI,” formerly referred to as “PLINK”) (Purcell et al., 2007). Denoting the two alleles at a SNP as “a” and “A,” assuming additive allele effects, both XCI and eXCI coding schemes code “aa”, “aA” and “AA” female genotypes as 0, 1, 2. For males, XCI coding assigns genotype “a” a value of 0 and “A” a value of 2, while eXCI assigns “a” and “A” genotypes values of 0 and 1, respectively. It should be noted that the choice of coding “a” or “A” as the effect allele may impact the coefficient estimates beyond just a change in sign, particularly for eXCI coding which codes the male genotypes on a smaller scale relative to females; therefore, it is important to include sex as a covariate (Chen et al., 2019).

2.2 Simulation: Data generation

Data were simulated under a variety of scenarios including assuming a locus undergoing XCI or escaping from XCI, as well as with and without SNP × Sex interactions. The simulated variables were sex (female and male), SNP genotype (aa, Aa, or AA for females; a or A for males) and outcomes (0,1). Each simulated data set consisted of roughly 500 females and 500 males, with female taken as the reference level (sex = 0) and male sex coded as 1. The frequencies of the minor allele was set at 0.3 and 0.5, and random XCI was assumed.

After simulating the sex and genotype data, binary outcomes were randomly generated using a logistic model conditional on sex and genotypes; specifically, two sets of outcomes were generated corresponding to the assumptions of a locus undergoing or escaping from XCI (referred to as “XCI” and “eXCI”) based on XCI (Clayton) and eXCI (PLINK) coded genotypes, respectively. For data generated under both the XCI and eXCI assumptions, outcomes were generated in two ways, using models with and without SNP × Sex interaction terms (i.e., in the absence and presence of sex differences in SNP effect).

Table 1a,b shows the data-generating models considered in the absence of sex differences in SNP effects. Specifically, a binary outcome Y was generated from a model with only main effects of sex and SNP, using the equation: urn:x-wiley:07410395:media:gepi22393:gepi22393-math-0001where SNP was coded assuming XCI or eXCI. The intercept was fixed at 0, and five combinations of βsex and βSNP coefficients were used (Table 1a,b). We generated 100 data sets under each parameter combination. Table 1. Data-generating models in the absence of SNP × Sex interaction effects using (a) XCI (Clayton) coding and (b) eXCI (PLINK) coding Coding Model coefficients OR for effect of sex ORs for effect of SNP, given sex Prevalence Sex # Copies of effect allele βsex βSNP ORsex (eβsex) ORSNP|M (e2βSNP) ORSNP|W1 (eβSNP) ORSNP|W2 (e2βSNP) Overall Female Male (a) XCI (Clayton) coding 0 1 2 0 0.75 e0 e1.5 e0.75 e1.5 0.664 0.668 0.660 M 0 2 NA 0.1 0.5 e0.1 e1 e0.5 e1 0.628 0.618 0.638 F 0 1 2 0.2 0.2 e0.2 e0.4 e0.2 e0.4 0.574 0.550 0.598 0.5 0.1 e0.5 e0.2 e0.1 e0.2 0.585 0.514 0.646 0.75 0 e0.75 e0 e0 e0 0.589 0.500 0.678 (b) eXCI (PLINK) coding 0 1 2 0 0.75 e0 e0.75 e0.75 e1.5 0.629 0.668 0.590 M 0 1 NA 0.1 0.5 e0.1 e0.5 e0.5 e1 0.602 0.618 0.586 F 0 1 2 0.2 0.2 e0.2 e0.2 e0.2 e0.4 0.562 0.550 0.574 0.5 0.1 e0.5 e0.1 e0.1 e0.2 0.579 0.524 0.634 0.75 0 e0.75 e0 e0 e0 0.589 0.500 0.678 Note: With the coding scheme used when generating the data specified on the left, we give how SNPs are coded within sex under each coding scheme. The “Model” column provides the five coefficient combinations we used to generate the data in this simulation study. We then calculated the odds ratio for the effect of sex and for the effect of SNP given sex. ORsex refers to the odds ratio for the effect of sex with female as the reference level and male equal to 1. ORSNP|M refers to the odds ratio for the effect of SNP, either comparing SNP = 0 with SNP = 2 in XCI (Clayton) or with SNP = 1 in eXCI (PLINK) coding, given sex = male. ORSNP|W1 refers to the odds ratio for the effect of SNP comparing SNP = 0 with SNP = 1 and ORSNP|W2 refers to the odds ratio for the effect of SNP comparing SNP = 0 with SNP = 2, given sex = female. In the “Prevalence” column, we calculated the proportion of cases in the overall population (1000 cases), in females (500 cases), and in males (500 cases). Table 2a,b shows the data-generating models considered in the presence of sex differences in SNP effects, introduced by including a Sex × SNP interaction term when we generated the data. The general form of the equation is: urn:x-wiley:07410395:media:gepi22393:gepi22393-math-0002 Table 2. Data-generating models in the presence of SNP × Sex interaction effects using (a) XCI (Clayton) coding and (b) eXCI (PLINK) coding Model ORs for effect of sex, given SNP ORs for effect of SNP, given Sex Prevalence Sex # Copies of effect allele βsex βSNP βint ORsex|SNP0 (eβsex) ORsex|SNP2 (eβsex+2βint) ORSNP|M (e2βSNP+2βint) ORSNP|W1 (eβSNP) ORSNP|W2 (e2βSNP) Overall Female Male (a) XCI (Clayton) coding 0 1 2 0.2 0.2 0.1 e0.2 e0.4 e0.6 e0.2 e0.4 0.585 0.550 0.620 M 0 2 NA 0.2 0.2 0.2 e0.2 e0.6 e0.8 e0.2 e0.4 0.595 0.550 0.640 F 0 1 2 0.2 0.2 0.3 e0.2 e0.8 e1 e0.2 e0.4 0.604 0.550 0.658 (b) eXCI (PLINK) coding 0 1 2 0.2 0.2 0.1 e0.2 e0.3 e0.3 e0.2 e0.4 0.568 0.550 0.586 M 0 1 NA 0.2 0.2 0.2 e0.2 e0.4 e0.4 e0.2 e0.4 0.574 0.550 0.598 F 0 1 2 0.2 0.2 0.3 e0.2 e0.5 e0.5 e0.2 e0.4 0.579 0.550 0.608 Note: with the coding scheme used when generating the data specified on the left, we give how SNPs are coded within sex under each coding scheme. The “Model” column provides the three coefficient combinations we used to generate the data in this simulation study. We then calculated the odds ratio for the effect of sex given SNP and for the effect of SNP given sex. ORsex|SNP0 refers to the odds ratio for the effect of sex (with female as the reference level and male equals to 1) given SNP = 0 for both coding schemes. For XCI (Clayton), ORsex|SNP2 refers to the odds ratio for the effect of sex given SNP = 2 (with female as the reference level); for eXCI (PLINK) coding, ORsex|SNP1 refers to the odds ratio for the effect of sex given SNP = 1 (with female as the reference level). ORSNP|M refers to the odds ratio for the effect of SNP, either comparing SNP = 0 with SNP = 2 in XCI or with SNP = 1 in eXCI coding, given sex = male. ORSNP|W1 refers to the odds ratio for the effect of SNP comparing SNP = 0 with SNP = 1 and ORSNP|W2 refers to the odds ratio for the effect of SNP comparing SNP = 0 with SNP = 2, given sex = female. In the “Prevalence” column, we calculated the proportion of cases in the overall population (1000 cases), in females (500 cases), and in male (500 cases).

Again, the intercept was fixed at 0, and either XCI or eXCI coding was used for the SNP variable. Three combinations of βsex, βSNP, and βint were chosen (Table 2a,b), and 100 data sets were generated under each combination.

2.3 Simulation: Data analysis

The simulated data were analyzed in multiple ways using XCI (Clayton) or eXCI (PLINK) coding and with and without SNP × Sex interaction terms. For each generated data set, regardless of whether the outcome was generated assuming XCI or eXCI, we fit four different models (Table 3).

Table 3. Logistic regression models fit to each of the simulated data sets, using either XCI (Clayton) or eXCI (PLINK) coding for the SNP effects, and with and without SNP × Sex interaction terms 1 Logit (Y) = intercept + βsex × Sex + βSNP × SNPXCI 2 Logit (Y) = intercept + βsex × Sex + βSNP × SNPeXCI 3 Logit (Y) = intercept + βsex × Sex + βSNP × SNPXCI + βint × Sex × SNPXCI 4 Logit (Y) = intercept + βsex × Sex + βSNP × SNPeXCI + βint × Sex × SNPeXCI

We first fit models including only main effects for sex and SNP. Among them, the first model used the XCI coding for the SNP variable (Model 1), while the second model used the eXCI coding for the SNP variable (Model 2). These models allowed us to study the coefficient estimates and p-values when the coding scheme used to generate data (reflecting assumptions about XCI) is inconsistent with the coding scheme used to fit data (i.e., the genotype-phenotype model), as well as when they are consistent. We coded sex = 0 for females and sex = 1 for males, and coded the SNP genotypes in terms of the same allele as in the data generation. We stored the coefficients and p-values for each model, and then calculated the bias of the coefficients (fitted value minus true value).

We also fit additional models that included interaction terms between sex and SNP (Models 3 and 4 in Table 3). Both the main effect SNP term and the interaction term used the same SNP coding within a model. In addition to assessing bias and p-values of the individual model terms, we conducted likelihood ratio χ2-tests (LRT). First, we compared all models with the model “logit (Y) = intercept + βsex × Sex” to test the overall effect of the SNP coefficients; this would be either a one or 2-df likelihood ratio χ2-test depending on which model we are testing (e.g., df = 1 for testing SNP effects in Model 1, but df = 2 for jointly testing SNP and SNP × Sex interaction effects in Model 3). Second, for the models with interaction terms we performed a 1-df likelihood ratio χ2-test comparing to a model “logit (Y) = intercept + βsex × Sex + βSNP × SNP” (comparing Model 3 with Model 1 and comparing Model 4 with Model 2), to test for sex differences in SNP effect.

For all combinations of data generation scenarios and analysis methods, we plotted the estimated coefficients/bias and p-values using boxplots, and plotted the proportions of p-values from likelihood ratio χ2 tests below a prespecified significance threshold to display power using bar-plots.

In addition to the coefficient combinations shown in Tables 1a,b and 2a,b, we also assessed additional scenarios with lower prevalence or a negative interaction term (Supporting Information Material). In the following sections, we will focus on results with MAF = 0.5 and with coefficient combinations and prevalence shown in Tables 1a,b and 2a,b. Key findings from additional scenarios will be discussed, without showing detailed results.

2.4 Simulation: Nonrandom XCI We also considered how the presence of XCI skewness or, in other words, nonrandom XCI, impacts coefficient biases when the genetic model is misspecified because it does not account for skewness. As shown previously, nonrandom/XCI skewness is equivalent to a genetic model that includes a dominance deviation (Chen et al., 2019). Therefore to generate data consistent with XCI skewness, we added a dominance variable (D) coded as (0, 1, 0) for female genotypes (aa, Aa, AA) and (0, 0) for male genotypes (a, A). The coefficient for this non-additive effect term D was set to be less than the additive effect (SNP term coefficient) to mimic the situation of skewness. The general form of the equation is: urn:x-wiley:07410395:media:gepi22393:gepi22393-math-0003

The coefficient combinations of sex and SNP terms were the same as above (Tables 1a,b and 2a,b) and the coefficient of the dominance effect term was set at 0.1 for all scenarios. As above, we fit Models 1–4 (Table 3) without adding a nonadditive effect term and performed likelihood ratio χ2 tests. Results were summarized similar to the scenarios without dominance effects.

2.5 Application to genetic associations with obesity

We sought to examine the observed changes in coefficient estimates from the four genetic models that were fit in simulations in the context of an applied example. Previously, the minor T allele of rs1316982, an X chromosome SNP in an intron of the gene IL13RA1, was identified as significantly associated with body mass index (BMI) in a GWAS (Hoffmann et al., 2018). Using data from the UK Biobank, we re-analyzed the relationship between this SNP and obesity (defined as BMI > 30) in a sample of N = 173,557 males and N = 201,630 females of European ancestry. Genotyping, imputation, and quality control procedures have been previously described (Hoffmann et al., 2018). We fit the four logistic regression models described in Table 3. For each model, we also included additional covariates for age, assessment center, genotyping batch, and the first 20 principal components to capture population stratification. We also fit a model with sex and the additional covariates, but without a SNP or SNP × Sex interaction term.

3 RESULTS 3.1 Part 1: Data generated in the absence of sex differences in SNP effect 3.1.1 Fitting models without SNP × Sex interaction

Figure 1 displays results for data generated in the absence of sex differences in SNP effect (Table 1a,b) and analyzed using either XCI (Clayton) (Model 1) or eXCI (PLINK) coding (Model 2) with no SNP × Sex interaction term (Figure 1 and Table S1) for situations with MAF = 0.5; results for MAF = 0.3 were similar (Figures S1.1–S1.6). As expected, we observed biases in the SNP and sex coefficients when there was model misspecification (i.e., inconsistency in the coding schemes used to generate and fit the data, e.g., generate data under XCI coding, but use eXCI coding to fit, or vice versa). The biases were greatest when the true effect of SNP (i.e., the magnitude of true SNP coefficient) was greatest. The direction of the bias (positive or negative) depends on how the variables are coded, in particular the chosen reference levels. As we stated previously, we fixed allele frequencies at 0.5 and coded female as 0 and male as 1. Under such circumstances, for data sets generated using eXCI coding, we observed negative biases (fitted < true) in both SNP and sex coefficients when analyzed using XCI coding. For data sets generated using XCI coding, we observed positive biases (fitted > true) in both sex and SNP coefficients when analyzed with eXCI coding. Notably, we also observed false-positive tests of the sex effects along with the coefficient biases, while the tests for the SNP effect had no type 1 error inflation.

image

Bias and p-values of sex and SNP coefficients when generating data in the absence of sex differences in SNP effect and fitting models without SNP–sex interaction terms. Top row: Boxplots of bias (Y-axis; estimate minus true coefficient) of sex and SNP coefficients across 1000 simulation runs for various simulation settings (X-axis) when data is generated using eXCI (PLINK) coding (left) or XCI (Clayton) coding (right). Bottom row: Boxplots of p-values (Y-axis) of sex and SNP coefficients across 1000 simulation runs for various simulation settings (X-axis) when data is generated using eXCI coding (left) or XCI coding (right). Color indicates the model that was fit (Model 1 or Model 2, XCI or eXCI coding without a SNP–sex interaction term)

When we changed the prevalence to be below 50%, as opposed to above 50% shown in Tables 1a,b and 2a,b, we observed the same patterns but with a change in the direction of the biases (Figures S2.1–S2.6).

3.1.2 Fitting models with SNP × Sex interaction terms

Figure 2 displays results for data generated in the absence of sex differences in SNP effect (Table 1a,b) and analyzed using either XCI (Clayton) coding (Model 3) or eXCI (PLINK) coding (Model 4) with a SNP × Sex interaction term. When an interaction term was included in the model to fit the data, the coefficients (and corresponding p-values) for sex and SNP were unbiased regardless of whether the SNP coding was correctly specified in the model (Figure 2 and Table S2). However, we observed biases in the interaction terms if there was an inconsistency between the coding schemes used to generate the data and to fit the data, especially in the scenarios with a stronger SNP effect. The p-values for interaction terms in those models were also significantly lower than expected under the null hypothesis of no interaction, leading to false-positive interaction signals under misspecification of the SNP coding.

image

Bias and p-values of sex and SNP coefficients when generating data in the absence of sex differences in SNP effect and fitting models with SNP–sex interaction terms. Top row: Boxplots of bias (Y-axis; estimate minus true coefficient) of sex and SNP coefficients across 1000 simulation runs for various simulation settings (X-axis) when data is generated using eXCI (PLINK) coding (left) or XCI (Clayton) coding (right). Bottom row: Boxplots of p-values (Y-axis) of sex and SNP coefficients across 1000 simulation runs for various simulation settings (X-axis) when data is generated using eXCI coding (left) or XCI coding (right). Color indicates the model that was fit (Model 3 or Model 4, XCI or eXCI coding with a SNP–sex interaction term)

Although false-positive evidence for interaction terms from the 1-df test was observed when the SNP coding was misspecified, the p-values from the 2-df test (joint test of SNP and SNP–sex interaction) were the same for both coding schemes (Figure 3a), even under misspecification, and the power estimates based on a p-value threshold of 0.01 (Figure 3b) were similar as well, as expected.

image

Two degree-of-freedom F tests for SNP coefficients for data generated in the absence of sex differences and fitting models with SNP-sex interaction terms. (a) Boxplots of p-values (Y-axis) of df = 2F tests across 1000 simulation runs for various simulation settings (X-axis) when d

留言 (0)

沒有登入
gif