A Model of Indel Evolution by Finite-State, Continuous-Time Machines [Population and Evolutionary Genetics]

Abstract

We introduce a systematic method of approximating finite-time transition probabilities for continuous-time insertion-deletion models on sequences. The method uses automata theory to describe the action of an infinitesimal evolutionary generator on a probability distribution over alignments, where both the generator and the alignment distribution can be represented by pair hidden Markov models (HMMs). In general, combining HMMs in this way induces a multiplication of their state spaces; to control this, we introduce a coarse-graining operation to keep the state space at a constant size. This leads naturally to ordinary differential equations for the evolution of the transition probabilities of the approximating pair HMM. The TKF91 model emerges as an exact solution to these equations for the special case of single-residue indels. For the more general case of multiple-residue indels, the equations can be solved by numerical integration. Using simulated data, we show that the resulting distribution over alignments, when compared to previous approximations, is a better fit over a broader range of parameters. We also propose a related approach to develop differential equations for sufficient statistics to estimate the underlying instantaneous indel rates by expectation maximization. Our code and data are available at https://github.com/ihh/trajectory-likelihood.

IN molecular evolution, the equations of motion describe continuous-time Markov processes on discrete nucleotide or amino acid sequences. For substitution processes, these equations are reasonably well understood, but insertions and deletions (indels) have proved less tractable.

This paper presents a new approach to analysis of indel processes. In this introduction, we first discuss core bioinformatics concepts such as alignments, define a continuous-time Markov process for indels, and review previously published approximations to the finite-time alignment distributions of this process, using hidden Markov models (HMMs). In the remaining sections we describe our new method (in the Materials and Methods), report on a simulation-based evaluation (in the Results), and discuss the implications of our results (in the Discussion).

Alignments as Summaries of Indel Histories

Our motivating goal is to calculate probabilities of sequence alignments, assuming an underlying instantaneous rate model of indel events. We will mostly consider alignments of two sequences that we will refer to as the “ancestor” and the “descendant,” where the likelihood function takes the form Embedded ImageEmbedded Image, where Θ represents model parameters (e.g., mutation rates) and t is a time parameter. Common uses of this likelihood function include performing sequence alignment (for downstream inference based on homology, such as protein structure prediction), finding maximum-likelihood estimates of the time parameter t (for example, as part of phylogenetic inference of ancestral relationships), and comparing different models or parameterizations Θ (for example, to measure the rate of evolution in sequences, or to annotate conserved regions).

We seek to derive this pairwise alignment likelihood directly from an instantaneous model of sequence mutation; that is, a continuous-time Markov chain whose state space is the set of all possible DNA or protein sequences. For practical purposes, we often need to summarize paths through this process, and it is worth distinguishing between different ways of doing so. We will use three progressively detailed descriptions of the evolutionary path which we refer to as alignments, trajectories, and histories (Figure 1), as described below.

Figure 1Figure 1Figure 1

Three views of evolutionary processes—alignments, trajectories, and histories—represent different levels of summarization. Alignments include no information about intermediate events except the positions of homologous residues; trajectories include intermediate sequences and the transition events between them, but not the times at which those events occurred; histories include transition events and times. Panels are illustrated using examples from the HOMSTRAD database (Mizuguchi et al. 1998); PDB identifiers are shown. (A) Part of an alignment of two proteins from PDB. (B) A single-event trajectory consistent with alignment A. (C) Two different two-event trajectories consistent with A. (D) A three-event trajectory consistent with A. (E) A history consistent with trajectory (B), in which the single event occurs between times u and u + du. (F) Two alternate alignments of the sequences in alignment A. (G) Several equivalent alignments containing adjacent insertions and deletions that have been rearranged in different ways. (H) An alignment that can only be explained by a model incorporating both substitution and indel events. (I) The gap profile of alignment H, and its associated M/I/D column types. (J) A multiple alignment whose misaligned gap boundaries do not seem to support the TKF92 model’s assumption that multiresidue gaps arise from indivisible sequence fragments.

Alignments

A pairwise alignment consists of the observed initial and final state of the process (the ancestral and descendant sequences), with gap characters to show which residues are descended from which. An example alignment is shown in Figure 1A. Most of our discussion will be at this level of summarization.

Trajectories

A trajectory includes all the intermediate sequences from ancestor to descendant. Transitions between the intermediate sequences correspond to instantaneous changes. A trajectory uniquely implies an alignment, but there are many trajectories consistent with each alignment: the most plausible trajectory for alignment 1A is shown in 1B, but the longer trajectories in 1C and 1D are also consistent. We will refer to this level of summarization when discussing some previous methods for indel analysis.

Histories

A history consists of a trajectory fully annotated with the time of each indel event. This is the most detailed description, being a complete specification of the path of the stochastic process. For each nontrivial trajectory, there is a continuum of possible histories. For example, history 1E is consistent with trajectory 1B, with an event time u in the range 0 ≤ u ≤ t. We will not refer much to this level as it contains more information than we usually care about.

Reflecting this hierarchy of summarization, we can writeEmbedded ImageEmbedded Imagewhere (u1, u2,u3…) represents all the event times in a history.

Note that the top-level summation is over alignments. Many scenarios demand that we marginalize ambiguous or uncertain alignments. For example, the alignments in 1F are plausible alternatives to 1A; in 1G, the ordering of insertions and deletions may be considered irrelevant for many purposes; and the placement of the second gap in 1H admits some uncertainty.

If the alignment likelihood can be represented as a path probability through a pair HMM, F, then we can perform this sum over alignments using the forward algorithm (Durbin et al. 1998), writing the result asEmbedded ImageEmbedded ImageThis paper focuses on the probability distribution of alignment gaps. In general, when we refer to a gap, we will mean a run of adjacent indels in any order, as in 1G. Because of the possibility of overlapping indel events, as in 1C, these gaps can arise in a number of different ways.

The general geometric indel model

Our starting point for defining an evolutionary process is the point substitution model, applied to a sequence. In such a model, each residue evolves according to a substitution rate matrix R, such as Kimura’s two-parameter model for DNA (Kimura 1980) or Dayhoff’s PAM model for proteins (Dayhoff et al. 1978).

We generalize this by allowing instantaneous insertion and deletion events as well as point substitution events. We do not want to be forced always to count the insertion of multiple adjacent residues as separate events (as in 1D), since this leads to inferential artifacts such as trajectories with too many events, alignments with scattered gaps, rates that are too fast, or times that are too long. Consequently, our model should allow events that insert or delete multiple residues instantaneously (as in 1B), with the indel length being a random variable.

For simplicity, we want to keep the number of parameters minimal, so we specify only the mean lengths of insertion and deletion events. The maximum entropy distribution for this parameterization is the geometric distribution. Thus, the probability that a given event involves n residues is Embedded ImageEmbedded Image for an insertion, and Embedded ImageEmbedded Image for a deletion, with mean lengths Embedded ImageEmbedded Image and Embedded ImageEmbedded Image. If the rate of insertions is λ and the rate of deletions is μ, then, at any given site, an event that inserts n residues has rate Embedded ImageEmbedded Image, where Embedded ImageEmbedded Image represents the actual n residues that were inserted, while an event that deletes n residues has rate Embedded ImageEmbedded Image. So, for example, the instantaneous event EALGVK→EALGKLGVK in the history shown in Figure 1E, which occurs during the time interval [u, u + du) and inserts the three residues GKL, has instantaneous rate Embedded ImageEmbedded Image and infinitesimal probability Embedded ImageEmbedded Image. The inserted residues are independently drawn from the stationary distribution of the substitution model, so Embedded ImageEmbedded Image where ρR = 0. Thus, Embedded ImageEmbedded Image. By contrast, deletion rates are completely independent of the residues being deleted.

To summarize, the parameters of our indel model are Embedded ImageEmbedded Image consisting of indel rates (λ, μ) and indel length parameters (x, y), together with a substitution rate matrix R. We call this model the general geometric indel (GGI) model, following De Maio (2020). The GGI model is the simplest continuous-time Markov chain over sequences that is local, allows multiresidue indels, and does not enforce reversibility. We may contrast the locality with the Poisson indel process, where the indel rate per site varies inversely with the total sequence length (Bouchard-Côté and Jordan 2013). As for multiresidue indels, other models such as TKF92 do allow this, but they do so by introducing unobservable auxiliary information into the state space; specifically, TKF92 introduces fragment boundaries. We can further constrain the parameters in various ways if desired; for example, by insisting that the model be reversible Embedded ImageEmbedded Image, as with the long indel model of Miklós et al. (2004); or by requiring perfect symmetry between insertions and deletions (λ = μ and x = y), as in the simulations of De Maio (2020); or by restricting indels to single residues (x = y = 0), so all trajectories look like Figure 1D, as with the TKF91 model of Thorne et al. (1991).

Derivation of alignment likelihoods from indel processes

In the previous section, we described the GGI model with instantaneous rates (λ, μ) and extension probabilities (x,y). We now review previous approaches to calculating alignment gap likelihoods under this model and related models. These methods include the pair HMMs we evaluate in this paper: TKF91 (Thorne et al. 1991), TKF92 (Thorne et al. 1992), MLH04 (Miklós et al. 2004), LG05 (Löytynoja and Goldman 2005), RS07 (Redelings and Suchard 2007), LAHP19 (Levy Karin et al. 2019), and DM20 (De Maio 2020).

All of these methods exploit the property of the GGI model that the indel and substitution processes are independent of one another. A pairwise alignment (1H) has a gap profile (1I) that is like a residue-masked silhouette of the alignment, comprising three types of column: matches (M), in which ancestral and descendant residues are aligned, and insertions (I) and deletions (D), in which either ancestor or descendant contains a gap. We can factorize the alignment likelihood into a term for the gap profile (written as a sequence of M’s, I’s, and D’s) and a conditionally independent set of terms for the actual residue content:Embedded ImageEmbedded ImageHere, MXY is the probability that a descendant residue is Y, conditional on the ancestor being X; while ρY is the probability that an inserted residue is Y. Since deletion events are residue-blind in the GGI model, and we have already conditioned on the ancestral sequence, we do not to include terms for the probability that the two deleted ancestral residues are N and K; the gap profile tells us what positions the deleted residues were at, and that is enough.

This decomposition of indel and substitution probabilities is naturally expressed in terms of a pair HMM with M, I, and D states. We can think of the gap profile term as the probability of the state path through the HMM, while the substitution terms correspond to the emission probabilities from those states. The emission part is well understood (Thorne et al. 1991): M and ρ can be linked to an underlying point substitution rate matrix R [by the matrix exponential M = exp(Rt) and the stationary distribution ρR = 0]. Our focus is on the likelihood of the gap profile: we seek a similar relationship Embedded ImageEmbedded Image between the transition probabilities of the pair HMM, Embedded ImageEmbedded Image, and the GGI model’s rate matrix over sequences, ℝ.

We now review previous work in this area.

TKF91:

The first approach, TKF91, addresses a restricted version of the GGI model allowing only single-residue indels. This reduces to a linear birth-death process, which can be solved exactly (Thorne et al. 1991). The probability distribution over alignments can be represented as a pair HMM (Holmes and Bruno 2001). Being exactly solvable, TKF91 has become the canonical example of an indel model. However, as noted previously, it leads to systematic biases during inference, imputing trajectories with too many events, as in Figure 1D.

TKF92:

Attempting to address the deficiencies of TKF91, the TKF92 model (Thorne et al. 1992) posits a similar birth-death process, but on indivisible multiresidue fragments instead of single residues. Each fragment contains a random, geometrically distributed number of residues. TKF92 has a closed-form pair HMM solution, rather like TKF91, but with the introduction of a new parameter corresponding to the mean fragment length. However, TKF92 is also somewhat unrealistic in practice. The idea that multiresidue gaps arise from unbreakable fragments is artificial, as can be illustrated with reference to the multiple sequence alignment of Figure 1J. The gap boundaries in (1J) do not align, and this is not uncommon: empirically, there is no evidence that TKF92’s indivisible fragments are real. In practice, when using TKF92, it is common to simply marginalize over the fragment boundaries, effectively treating TKF92 as an ad hoc approximation to the GGI model. Therefore, by defining a suitable mapping between TKF92’s fragment parameters and the GGI model’s indels, we can evaluate it on this basis, as an approximation to GGI.

LG05 and RS07:

Similarly, the LG05 pair HMM used in the PRANK program (Löytynoja and Goldman 2005) and the RS07 pair HMM used in BAliPhy (Redelings and Suchard 2007) introduce fragment length parameters that can be related (with some hand-waving) to the indel length parameters of GGI. In this paper, we evaluate TKF91, TKF92, LG05, and RS07 as approximations to GGI, but we do not evaluate some other indel models that are a little harder to reconcile with GGI because of extra parameters (Rivas and Eddy 2015) or incompatible assumptions (Bouchard-Côté and Jordan 2013).

MLH04:

The MLH04 approach, developed by Miklós et al. (2004), computes lower-bound likelihoods for alignment gaps by considering short trajectories like those in Figure 1, B–D, integrating out the event times from the corresponding histories to find a likelihood for each such trajectory. To calculate the likelihood of an alignment gap, MLH does a brute-force exhaustive enumeration of all consistent trajectories, up to a given number of indel events and a maximum gap length. Under the assumption of an infinite sequence, the resulting distribution is technically still a pair HMM, albeit one with an infinite number of states (corresponding to every possible size of gap). As our simulations in the Results section demonstrate, this approach is extremely slow, and effectively impossible for trajectories with more than three overlapping indel events; however, for very short evolutionary times, MLH04 remains the most accurate approximation to GGI, short of direct simulation.

In the special cases of TKF91 and TKF92, the alignment gap lengths are geometrically distributed. This is not necessarily true in general for the GGI model (Rivas and Eddy 2015): alignment gap lengths are not geometrically distributed even though the underlying indel event lengths are. Thus, a simple three-state pair HMM—whose gap lengths are geometrically distributed—cannot be an exact solution to GGI. Nevertheless, MLH04 shows that the exact solution is, in fact, an infinite-state pair HMM, so a smaller pair HMM may be a reasonable approximation.

LAHP19:

A purely simulation-based approach to estimating the gap probabilities of the GGI model has recently been described (Levy Karin et al. 2019). In the limit of an infinite number of random trials, this approach is exact. We use such simulations as a gold standard to evaluate other approximations. However, the number of trials required to sample rare outcomes (i.e., long gaps, particularly those involving multiple-event trajectories) is large, and the simulations become computationally expensive with longer sequences. The performance and sampling limitations of this approach are further discussed in the Results.

DM20:

The DM20 method is a recent breakthrough in approximating the GGI model (De Maio 2020). Starting from the assumption that the alignment likelihood can be approximated by a product of geometric distributions over insertion and deletion lengths, De Maio derived ordinary differential equations (ODEs) for the evolution of the mean lengths of these distributions, yielding transition probabilities for the pair HMM. DM20 is a more accurate approximation to the multiresidue indel process than all previous attempts, although it has limitations: it does not allow deletions to directly follow insertions in the alignment (thus limiting its ability to model covariation between insertion and deletion lengths), it is inexact for the special case of the TKF91 model, and it requires laborious manual derivation of the underlying ODEs.

H20:

The H20 method, developed in this paper, builds on DM20 to develop a systematic differential calculus for finding HMM-based approximate solutions of continuous-time Markov processes on strings that are “local” in the sense that the infinitesimal generator is a pair HMM. Our approach addresses the limitations of DM20, identified in the previous paragraph. It does allow deletions to follow insertions, so as to better account for covariation between insertion and deletion gap sizes. The TKF91 model emerges as a special case: the closed-form solutions to TKF91 are also exact solutions to our model. Finally, although our equations can be derived without computational assistance, the analysis is greatly simplified by the use of symbolic algebra packages, both for the manipulation of equations, for which we used Mathematica (Wolfram Research, Inc.) (version 2020), and for the manipulation of state machines, for which we used our recently published software Machine Boss (Silvestre-Ryan et al. 2020).

The central idea of our approach is that the application of the infinitesimal generator to the approximating HMM generates a more complicated HMM that, by a suitable coarse-graining operation, can be mapped back to the simpler structure of the approximating HMM. By matching the expected transition usages of these HMMs, we derive ODEs for the transition probabilities of the approximator. Our approach is justified by improved results in simulations, yielding greater accuracy and generality than all previous approaches to this problem, including DM20 (which can be seen as a restricted version of our method). Our approach is further justified by the emergence of the TKF91 model as an exact special case, without the need to introduce any additional latent variables such as fragment boundaries.

While we focus here on the multiresidue indel process, the generality of the infinitesimal automata suggests that other local evolutionary models, such as those allowing neighbor-dependent substitution and indel rates, might also be productively analyzed using this approach.

The sequence rate matrix and the infinitesimal-time machine:

We now give a concise preview of the approach described in detail in the Materials and Methods.

The rate matrix ℝ of the GGI model is, for two sequences Embedded ImageEmbedded Image:Embedded ImageEmbedded Imagewhere Ω is the residue alphabet (e.g., nucleotides or amino acids), Ω* is the set of all sequences over that alphabet (including the empty sequence ε), ΩN is the set of all sequences of finite length N, B is the deleted sequence, C is the inserted sequence, and Embedded ImageEmbedded Image are flanking sequences (we will mostly be considering the infinite-sequence approximation, where Embedded ImageEmbedded Image).

Suppose that Embedded ImageEmbedded Image is a sequence evolving under the GGI model. Consider Embedded ImageEmbedded Image, the pair HMM defined in Figure 2B and Table 1. Assuming ψ (t) is infinitely long, the forward algorithm for Embedded ImageEmbedded Image computes the conditional distribution over an instant of evolutionary time:Embedded ImageEmbedded Imagewhere I is the identity matrix over sequences (Embedded ImageEmbedded Image if X = Y, 0 if X ≠ Y).

Figure 2Figure 2Figure 2

Three state machines for modeling indels in alignments. Match states (σM) are orange, insert states (σI) are green, delete states (σD) are red, and null states (σN) are uncolored. Transitions are colored by destination state. States for the machines in A–C are further described in Tables 1, 3, and 4. Our approach is to approximate C with a machine of the same form as A. (A) Machine Embedded ImageEmbedded Image, defined under Three-state HMM and in Table 3, models alignments at divergence time t. (B) Machine Embedded ImageEmbedded Image, defined under Infinitesimal-time machine and in Table 1, models the infinitesimal evolution over time Δt. (C) Machine Embedded ImageEmbedded Image, defined under Rate of change of expected transition counts and in Table 4, models alignments at divergence time t + Δt. It is the machine product of Embedded ImageEmbedded Image and Embedded ImageEmbedded Image: each state has the form XY where X is an F state and Y is a G state. Uppercase is used to indicate that a component machine makes a transition when the compound state is entered. So, for example, when Embedded ImageEmbedded Image makes the transition mI → MM, the transition weight is the product of a (for F’s self-looping M → M transition) and 1 − x (for G’s I → M transition). However, if then makes the transition MM → Dm, the transition weight is just c (for F’s M → D transition), since G stays in the M state without making a transition. This structure arises from simple rules for transition synchronization in multiplied machines (Westesson et al. 2011, 2012).

Table 1 Interpretation of states in machine Embedded ImageEmbedded Image (Figure 2B, defined in Infinitesimal-time machine); here, Embedded ImageEmbedded Image represent input and output tokens from the residue alphabet

Our approach is to find a pair HMM, Embedded ImageEmbedded Image (Figure 2A), that approximates the matrix exponential Embedded ImageEmbedded Image, by mapping the machine product Embedded ImageEmbedded Image (Figure 2C) back onto Embedded ImageEmbedded Image. We match expected transition counts between classes of states in Embedded ImageEmbedded Image to their representative transitions in Embedded ImageEmbedded Image, and take the limit Δt → 0 to derive differential equations for the transition weights of Embedded ImageEmbedded Image.

A note on finite sequences: to model these we can set Embedded ImageEmbedded Image for Embedded ImageEmbedded Image, dropping the 1 − y term for deletions that remove the end of the sequence. This ensures the total rightward deletion rate starting at any residue is μ, regardless of its distance from the end. We can then define Embedded ImageEmbedded Image as in Figure 3A. Imposing reversibility on finite-sequence models takes slightly more care (Miklós et al. 2004).

留言 (0)

沒有登入
gif