Inferring the Allelic Series at QTL in Multiparental Populations [Multiparental Populations]

Abstract

Multiparental populations (MPPs) are experimental populations in which the genome of every individual is a mosaic of known founder haplotypes. These populations are useful for detecting quantitative trait loci (QTL) because tests of association can leverage inferred founder haplotype descent. It is difficult, however, to determine how haplotypes at a locus group into distinct functional alleles, termed the allelic series. The allelic series is important because it provides information about the number of causal variants at a QTL and their combined effects. In this study, we introduce a fully Bayesian model selection framework for inferring the allelic series. This framework accounts for sources of uncertainty found in typical MPPs, including the number and composition of functional alleles. Our prior distribution for the allelic series is based on the Chinese restaurant process, a relative of the Dirichlet process, and we leverage its connection to the coalescent to introduce additional prior information about haplotype relatedness via a phylogenetic tree. We evaluate our approach via simulation and apply it to QTL from two MPPs: the Collaborative Cross (CC) and the Drosophila Synthetic Population Resource (DSPR). We find that, although posterior inference of the exact allelic series is often uncertain, we are able to distinguish biallelic QTL from more complex multiallelic cases. Additionally, our allele-based approach improves haplotype effect estimation when the true number of functional alleles is small. Our method, Tree-Based Inference of Multiallelism via Bayesian Regression (TIMBR), provides new insight into the genetic architecture of QTL in MPPs.

MULTIPARENTAL POPULATIONS (MPPs) are experimental populations of model organisms generated by breeding a small but genetically diverse set of inbred parents to produce individual offspring whose genomes are mosaics of the original founder haplotypes (Churchill et al. 2004; Cavanagh et al. 2008; King et al. 2012). Because these haplotype mosaics succinctly describe the genetic differences between individuals, the standard approach for interrogating the genetic basis of quantitative traits in MPPs is haplotype-based association (Mott et al. 2000; Valdar et al. 2006; Aylor et al. 2011; Collaborative Cross Consortium 2012; Huang et al. 2015; Broman et al. 2019).

The typical protocol for haplotype-based association is as follows. First, haplotypes are inferred along the genome for each individual by comparing its genotypes with those of the founders (Mott et al. 2000; Zheng et al. 2015; Broman et al. 2019). Then, quantitative trait locus (QTL) mapping proceeds by testing at each genomic locus the association of the trait with inferred founder haplotype state. For example, a linear model for the additive effect of J founder haplotypes at a genomic locus on a quantitative trait is given byEmbedded ImageEmbedded Imagewhere yi is the quantitative trait measurement of individual i, xji is the number of copies of the founder j haplotype, with Embedded ImageEmbedded Image if haplotypes are known and Embedded ImageEmbedded Image if counts are estimated as imputed dosages, βhap,j is the additive effect of the founder j haplotype, and εi is normally distributed error.

Haplotypes provide a richer source of information than observed variants (Haley and Knott 1992; Martínez and Curnow 1992). Whereas observed variant approaches such as single nucleotide polymorphism (SNP) association typically assume biallelic effects, haplotype-based association (technically a type of linkage disequilibrium mapping) tests the combined effects of all variants within the genomic interval, including any local epistastic interactions or variants that are unobserved or undiscovered (Zhang et al. 2014). This permits detection of complex genetic signals that may not be revealed by single-variant approaches, an advantage that has contributed to the widespread development of MPPs across a variety of biomedically (Churchill et al. 2004; Collaborative Cross Consortium 2012; Macdonald and Long 2007; King et al. 2012; Kover et al. 2009) and agriculturally (Huang et al. 2015) important model organisms and species.

Nonetheless, the results of haplotype-based association do not translate directly into knowledge about causal variants. This is because the standard model used for haplotype-based association assumes that all haplotypes are functionally distinct and that their effects are independent. This latter assumption is biologically unlikely: it is more reasonable to expect that there are only a few causal variants at a locus, and that combinations of these variants will often be shared across haplotypes. More specifically, we expect that sets of shared causal variants partition the haplotypes into a potentially smaller number of functionally distinct alleles, with this assignment of haplotypes to functional alleles termed the allelic series.

An allelic series is easily superimposed onto the standard haplotype model by merging haplotypes into groups (Yalcin et al. 2005; Mosedale et al. 2019). For example, a linear model in the case of J = 5 haplotypes grouped into K = 3 functional alleles might be given byEmbedded ImageEmbedded Imagewhere haplotypes are functionally identical for founders 1 and 4, and for founders 2 and 3. In this case, the J-vector of haplotype effects Embedded ImageEmbedded Image is collapsed into a K-vector of allele effects Embedded ImageEmbedded Image. This relationship is described in matrix notation by βhap = Mβalle, where M is a J × K indicator matrix that encodes the allelic series (Jannink and Wu 2003). Doing this, however, requires knowing the allelic series in advance, information that is not typically available.

Knowledge of the allelic series, and in particular, whether it is biallelic (K = 2) or multiallelic (K > 2), is critical for inference about the number of causal variants at a locus. This allelic perspective also suggests that the haplotype-based association approach is inefficient because it estimates redundant parameters when some haplotypes may be functionally equivalent (K ≤ J). Thus, an allele-based association approach would provide valuable insights into the number of causal variants while potentially improving effect estimation.

In this study, we introduce a method for QTL analysis that explicitly models an allelic series of haplotypes. Our method treats the allelic series as an unknown quantity that must be inferred from the data. In the context of the previous linear model, this means inferring the indicator matrix M while K is also unknown. This is a challenging problem because the number of possible allelic configurations is large even when the number of haplotypes is small.

There are currently no established methods for inferring the allelic series in MPPs, with QTL methods focused instead on, for example, accommodating uncertainty due to haplotype reconstruction (Mott et al. 2000; Kover et al. 2009; Durrant and Mott 2010; Zhang et al. 2014), or incorporating multiple QTL or terms for polygenic population structure (Valdar et al. 2009; Yuan et al. 2011; Gatti et al. 2014; Wei and Xu 2016). A recent study explored the relationship between the allelic series and QTL mapping power, but this was in the context of a haplotype-based association approach (Keele et al. 2019).

Inference of the allelic series in practice is often subjective, combining patterns in haplotype effect estimates with some intuition about the number of functional alleles (Aylor et al. 2011; Kelada et al. 2012). Yalcin et al. (2005) developed a method called merge analysis that compares biallelic contrasts of merged haplotypes, as suggested by SNPs present in available sequence data, with the full haplotype model to see which most parsimoniously fits the data. This was trivially extended to multiallelic variants in Mosedale et al. (2019). In our framework, those approaches treat M and K as known and assume that they are implied by a single observed variant. King et al. (2014) also generalized merge analysis to interrogate multiallelic contrasts. Their approach implies a uniform prior distribution over the allelic series, Embedded ImageEmbedded Image Their procedure, however, was ad hoc and not embedded within a broader statistical framework that could account for prior information about the allelic series.

The approach most closely resembling ours is Jannink and Wu (2003), used to infer the allelic series in doubled haploid lines. Their method places either a uniform or Poisson distribution on K, with the conditional allelic series then distributed uniformly, Embedded ImageEmbedded Image. They found that an allele-based model improves haplotype effect estimation but that inference of the allelic series itself was generally uncertain. Notably, their approach did not incorporate prior information about the relatedness of the haplotypes, which they identified as a key limitation. It is reasonable to expect that closely related haplotypes are more likely to be functionally identical than distantly related haplotypes, and, consequently, that including this information would improve allelic series inference. Accounting for haplotype relatedness in an allele-based association framework is the primary innovation of our research.

Our approach frames inference of the allelic series as a Bayesian model selection problem. As suggested above, this requires specifying a prior distribution over the space of allelic configurations, p(M). Since this space is often much larger than the number of observations, information about the allelic series will typically be low. This makes the prior distribution critical, as it provides the basis for setting expectations about the number of functional alleles and their haplotype composition.

Our prior for p(M) is based on the Chinese restaurant process (CRP), which is the distribution over partitions that underlies the popular Dirichlet process mixture model (Escobar and West 1995; Müller et al. 2015). In this framework, the haplotypes are partitioned into a potentially smaller set of functional alleles, with the alleles having independent effects. The CRP allows for control over the prior number of alleles via its concentration parameter, but it implicitly assumes equal relatedness between individual haplotypes. We generalize the CRP prior to allow for unequal relatedness between the haplotypes by leveraging a particular property, namely, that the CRP can be described as the distribution of partitions induced by functional mutations on random coalescent trees, a representation known as Ewens’s sampling formula (Ewens 1972; Kingman 2006) (example in Figure 1).

Figure 1Figure 1Figure 1

Allelic series induced by functional mutations on coalescent trees of haplotypes. Alleles are denoted by circle color (yellow, blue, green) and functional mutations are denoted by purple hashes. (A) One functional mutation on a tree partitions four haplotypes into two functional alleles: | . (B) Additional mutations on the same tree partition the haplotypes into three functional alleles: | | . The second mutation does not affect . (C) Two functional mutations on a different tree partition the haplotypes into the same allelic series: | | . Note that the allelic series from the first example, | , is impossible given this tree.

Ewens’s sampling formula provides an intuitive mechanism for introducing prior information about haplotype relatedness: assuming that the phylogenetic tree of the haplotypes is known rather than random. This defines a prior distribution over the allelic series that is informed by a tree, Embedded ImageEmbedded Image In this way, our approach is similar to other models that include phylogenetic information; for example, by modeling distributional “changepoints” on a tree (Ansari and Didelot 2016), or by using phylogenetic distance as an input for a distance-dependent CRP (Cybis et al. 2018), among others (Zhang et al. 2012; Thompson and Kubatko 2013; Behr et al. 2020; Selle et al. 2020). In particular, Ansari and Didelot (2016) specify a prior distribution over the allelic series by defining the prior probability that each branch of a tree is functionally mutated with respect to a phenotype (in their case, a categorical trait). This is also how we define Embedded ImageEmbedded Image and we highlight that this is embedded within the broader population genetics framework of Ewens’s sampling formula (Ewens 1972; Kingman 2006), with its connections to both the CRP and the coalescent (Berestycki 2009).

In the remainder, we introduce a fully Bayesian framework for inferring the allelic series and additive allele effects in MPPs. This places the allelic series on a continuum that encompasses both single-variant and haplotype-based approaches. Our approach accounts for multiple sources of uncertainty found in typical MPPs, including uncertainty due to haplotype reconstruction, the number of functional alleles (Escobar and West 1995; Müller et al. 2015), and the magnitude of their effects (Gelman 2006). We outline a strategy for posterior inference using a partially collapsed Gibbs sampler (Neal 2000; van Dyk and Park 2008; Park and Van Dyk 2009) and use posterior samples and Rao-Blackwellization to calculate the marginal likelihood (Blackwell 2007; Chib 1995), which is useful for comparing competing model assumptions (Kass and Raftery 1995). We then evaluate various properties of the allelic series approach via simulation and highlight several key findings. We conclude by presenting a series of illustrative real-data examples, from incipient lines of the Collaborative Cross (PreCC) (Kelada et al. 2012, 2014) and the Drosophila Synthetic Population Resource (DSPR) (King et al. 2014), that showcase the inferences facilitated by our allele-based approach.

Materials and MethodsOverview

At a quantitative trait locus (QTL), a trait Embedded ImageEmbedded Image measured in N individuals Embedded ImageEmbedded Image is associated with genetic variation at a particular location in the genome. In a diploid multiparental population with J ≥ 2 founder strains Embedded ImageEmbedded Image this genetic variation is encoded by the pair of founder haplotypes, or the diplotype, at the locus, denoted for each individual by the indicator vector di with length Embedded ImageEmbedded Image This length corresponds to the number of possible founder haplotype pairs, of which J are homozygous and Embedded ImageEmbedded Image are heterozygous. The diplotype states of all individuals are given by the indicator matrix Embedded ImageEmbedded Image with dimension Embedded ImageEmbedded Image We are interested in understanding the relationship between y and D.

For now, assume that the diplotype states are known, and that the phenotype is completely explained by the additive effects of the haplotypes and normally distributed individual error, i.e., there are no other covariates, replicate observations, or population structure. Under these conditions, a linear model for y and D isEmbedded ImageEmbedded Image(1)where Embedded ImageEmbedded Image is a J-vector of haplotype effects, A is a Embedded ImageEmbedded Image matrix that maps diplotypes state to haplotype frequency such that DA is an N × J design matrix of additive haplotype half-counts for each individual (i.e., the jith element is half the number of founder j haplotypes at the locus for individual i), and σ2 scales the residual variance.

A standard Bayesian analysis of this linear model assumes that the haplotype effects are a priori distributed asEmbedded ImageEmbedded Imagewhere μ is an intercept and ϕ controls the size of the haplotype effects relative to individual error (Servin and Stephens 2007). This fits an independent effect for each of the J founder haplotypes, implicitly assuming that, with respect to the phenotype, each founder haplotype is functionally distinct. This assumption, however, is rarely expected in practice. It is more realistic to assume that haplotypes group are grouped into K ≤ J functional alleles, with the assignment of haplotypes to functional alleles termed the allelic series.

Our approach extends the standard additive model to explicitly account for the allelic series, as in Jannink and Wu (2003). This decomposes the haplotype effects into the product of the allelic series matrix and a vector of allele effects:Embedded ImageEmbedded Image(2)where Embedded ImageEmbedded Image is a length K vector of allele effects, Embedded ImageEmbedded Image is a J × K matrix denoting the allelic series, and mj is a length K indicator vector denoting the allele assignment of strain j. For example, if there J = 3 haplotypes (labeled A, B, and C), and haplotypes A and C share one of K = 2 functional alleles, then the corresponding allelic series matrix isEmbedded ImageEmbedded ImageThe haplotype effects βhap are not independent and functionally distinct, but instead are comprised of repeated values of a smaller set of allele effects βalle. More generally, the allelic series matrix M partitions the J haplotypes into K functional alleles, which also determines the number of allele effects in βalle. If the allelic series is known and K < J, this approach will estimate βhap more efficiently than the standard haplotype-based approach because it fits only K allele effects, rather than J redundant haplotype effects.

The allelic series is rarely known a priori, but it may be inferred from the data. From a Bayesian perspective, we are interested in the allelic series’ posterior distribution,Embedded ImageEmbedded Imagewhich requires specifying a prior distribution over the space of possible partitions encoded by M. Specifying a prior distribution over this space involves simultaneously defining expectations about the number of functional alleles and which combinations of haplotypes are more or less likely to be functionally distinct. This is particularly challenging when there are many founder haplotypes, as the space of allelic series partitions becomes exceedingly large.

Partition problems are common in Bayesian nonparametric statistics, and our approach is closely related to the popular Dirichlet process (DP) with a normal base distribution, i.e.,Embedded ImageEmbedded Imagewhere α is the concentration parameter (Escobar and West 1995). Under the DP, the corresponding prior distributions on M and βalle areEmbedded ImageEmbedded ImageEmbedded ImageEmbedded ImageIn the CRP, the concentration parameter α controls the prior distribution of the number of functional alleles, and the distribution over particular allelic configurations is implied by the process itself. Specifically, the CRP assigns a haplotype to an allele conditionally, in proportion to the number of haplotypes already assigned to that allele, without considering which particular haplotypes comprise the allele. In this way, the CRP is uninformative with respect to the relationship between individual haplotypes. We use the CRP as a starting point for directly modeling the allelic series, eventually modifying it in order to introduce additional prior information about haplotype relatedness.

The remainder of the methods first describes the likelihood function in more detail, using sum-to-zero contrasts and conjugate prior distributions to simplify the likelihood. Then, it focuses on the CRP, using Ewens’s sampling formula to show how the CRP can be interpreted as a distribution over random coalescent trees with the haplotypes at the leaves. Next, this connection to the coalescent is used to define an informative prior distribution for the allelic series that reflects information about haplotype relatedness, as encoded by a phylogenetic tree. Prior distributions are then specified for the remaining model parameters, along with elicitation of prior hyperparameters. A graphical summary of the fully specified model is shown in Figure 2. Next, we describe posterior inference via a partially collapsed Gibbs sampler and how the output of this sampler can be used to estimate the marginal likelihood. The section ends by describing the simulations and real data examples presented in the results.

Figure 2Figure 2Figure 2

Graphical summary of the allele-based association approach. Shaded nodes are data, open nodes are variables, and double circle nodes are hyperparameters. Selected nodes are annotated to aid interpretation. Note that β is annotated as “allele effects” for brevity, but we actually depict the K − 1 independent effect vector, not βalle. We also omit the subscripts on the hyperparameters aα and bα. Arrows indicate dependencies between nodes. Dashed arrows are probabilistic dependencies, and labels denote the probability distributions linking the nodes. For distributions with multiple parameters, the star (*) indicates the parameter of the parent node, and the dot (.) is a placeholder for the other parameter. The notation Embedded ImageEmbedded Image is shorthand for the unnamed distribution Embedded ImageEmbedded Image given in the text. Solid arrows are deterministic dependencies, and labels denote the operation linking the nodes. The partial shading of the T node denotes that the tree can either be specified as data or treated as a variable with a coalescent prior distribution, which is parameterless. When T is a variable with a coalescent prior, T and b can be integrated from the model. Integration removes these nodes from the graph, leaving only the probabilistic dependency of M on α, given by the gray dotted arrow.

Likelihood function

The likelihood function defines the relationship between the phenotype and the diplotype states. Substituting Equation 2 into Equation 1 gives the likelihood functionEmbedded ImageEmbedded ImageWe make one additional substitution, reformulating the allele effects in terms of an intercept and a set of effects that are constrained to sum to zero,Embedded ImageEmbedded Imagewhere C is a K × (K − 1) matrix of orthogonal sum-to-zero contrasts (Crowley et al. 2014) and β is a K − 1 vector of independent effects with mean zero. This substitution yields the likelihoodEmbedded ImageEmbedded Imagewhich now includes the intercept. In the expression DAMCβ, note that A and C are fixed, whereas D, M, and β are variables inferred by the model.

Our approach requires evaluating the likelihood over many settings of M with varying dimension. This motivates the use of conjugate priors for the allele effects, which allow us to simplify the likelihood by integrating, or collapsing, the allele effects out of the expression (Servin and Stephens 2007). Thus, we use the conjugate normal-gamma prior distribution for the precision, intercept, and independent effects:Embedded ImageEmbedded ImageEmbedded ImageEmbedded ImageEmbedded ImageEmbedded Imagewhere κ, λ are shape and rate hyperparameters that control prior precision, and τ is a hyperparameter that controls the informativeness of the prior intercept. Marginally, each allele effect is distributed according toEmbedded ImageEmbedded Imagewhich is similar to the DP described earlier, but jointly, βalle is constrained to sum to μ.

The likelihood can be rewritten asEmbedded ImageEmbedded Imagewhere X is the N × K design matrix, Embedded ImageEmbedded Image and θT is a length K vector containing the intercept and independent effects, Embedded ImageEmbedded Image The intercept and independent effects are jointly distributed according toEmbedded ImageEmbedded Imagewhere V is a K × K diagonal matrix of the scaled prior covarianceEmbedded ImageEmbedded ImageConjugacy yields a closed form for a simplified, t-distributed likelihood function:Embedded ImageEmbedded ImageThis simplified likelihood depends only on the diplotype states, allelic series configuration, and relative variance of the allele effects—this is useful during posterior inference. It is straightforward to generalize the likelihood to account for covariates and replicate observations (Appendix A).

Prior distribution of the allelic seriesChinese restaurant process:

Specifying a prior distribution over the allelic series involves defining expectations about the number of functional alleles and likely allelic configurations. This is challenging because the space of possible allelic series is large even when the number of haplotypes is small. For example, the Collaborative Cross (CC) has J = 8 founder haplotypes and 4140 possible allelic series; the DSPR has J = 15 and over 1.3 billion possibilities (Rota 1964). Encoding specific prior intuitions about such a space is difficult. It is tempting to consider a uniform prior over the allelic series Embedded ImageEmbedded Image that allows the likelihood to drive posterior inference about the allelic series. However, in most cases, the number of observations will be much smaller than the number of possible allelic configurations, and this low-data scenario is precisely when prior information is most important. Instead of posterior inference being dominated by the likelihood, it will be subject to the properties of the uniform distribution, which include a strong prior belief in an intermediate number of functional alleles and a lack of flexibility to calibrate this belief.

Partition problems occur frequently in Bayesian nonparametric statistics, and a common and more flexible prior distribution is the CRP,Embedded ImageEmbedded Imagewith probability density functionEmbedded ImageEmbedded Imagewhere α is a concentration parameter that controls the expected number of functional alleles, and Jk is the number of haplotypes assigned to allele k (Escobar and West 1995). The CRP is widely used in partition problems because it is exchangeable, making it amenable to posterior sampling. Exchangeability means that the density function of the CRP can be factored into conditional distributions that describe the allele assignment of a particular haplotype given the allelic configuration of all the other haplotypes. It also means that this conditional density can be applied iteratively (and in any order), beginning with all haplotypes unassigned, to construct the unconditional density of Embedded ImageEmbedded Image (

留言 (0)

沒有登入
gif