Nonparametric Bayesian inference for meta-stable conformational dynamics

Computational prediction of molecular structures is a long-standing challenge in structural biology. Recently, novel frameworks for protein [1] and ribonucleic acid (RNA) structure prediction [2] have been proposed that greatly enhance the prediction accuracy and the model applicability compared to existing approaches. The focus of structure prediction however lies on static structures. To obtain deeper insights into molecular processes, it is necessary to complement such approaches with frameworks that also take into account the dynamics of molecular structure.

Both experimental and computational tools exist to study such conformational dynamics: a prime example of experimental approaches to this challenge are analyses of conformational switching of ion channels via widely adopted electrophysiological techniques, such as voltage clamping or lipid bilayer measurements [35]. On the other hand, ion channel switching can also be studied computationally via, e.g., molecular dynamics (MD) simulations [6]. The MD framework more generally can also be used to investigate protein and RNA folding [7, 8]. Experimentally, this can be assessed, e.g., via fluorescence quenching [9] or fluorescence resonance energy transfer measurements [10]. Since in all of these settings, both experimental and computational protocols typically yield large amounts of data, their analysis is a challenge in itself.

To understand the switching dynamics of the system under study, one needs to obtain a coarse-grained description of the continuous system dynamics (e.g., voltages or atom coordinates) in terms of comparatively long-lived, meta-stable discrete states corresponding to distinct stable structural conformations [11]. These are defined by a separation of time scales between the intra- and inter-state dynamics. In the context of MD, particularly, the framework of Markov state models (MSMs) has received much attention in recent years; besides extensive methodological research [1215], MSMs have also been successfully applied to a wide range of use cases [1623]. The classical MSM approach approximates the continuous simulation dynamics (mathematically expressible in terms of the propagation operator) directly by projecting them to a discretized space on which then a discrete-time Markov chain (DTMC) is constructed. Importantly, the binning of continuous data into discrete states introduces correlations and makes the resulting discrete process generally non-Markovian; the raw MD data, in contrast, are inherently Markovian, as they originate from the integration of stochastic differential equations (SDE).

To render the resulting discrete dynamics amenable to MSM analysis, the correlations need to be reduced via temporal thinning by some lag time constant [13, 24]. This results in two key parameters that need manual selection: the lag time and the state-space discretization. An explicit error bound for the reconstruction error of the MSM reconstruction and the true propagation operator can be derived in terms of these two parameters, showing that this error can be made arbitrarily small by either choosing a finer state-space discretization or decreasing the lag time [13]. This relationship between state-space discretization and lag time results in a trade-off problem: one has to balance between (i) sufficient sampling of the discrete state-space, viz, a coarse state-space discretization and (ii) a sufficiently long lag time to render the resulting process Markovian. To address this issue, tools such as the Chapman–Kolmogorov test have been introduced [13]. These tests, however, only consider the appropriateness of the lag time; the overall reconstruction error may still be off, resulting in an MSM not reproducing the long-time dynamics accurately, as detailed in [25]. Also, the lag time selection itself is acknowledged to be a major challenge in practice [26].

The identification of meta-stable conformational states of the system is then carried out after data pre-processing given some state-space discretization and lag time. Typically, this is done utilizing spectral methods such as Perron-cluster cluster analysis (PCCA) [27] or PCCA+ [28]; other approaches are however also possible, see e.g., [29, 30]. In general, the result of this procedure will depend on the chosen lag time as well as the state-space discretization.

While the framework as such has seen several refinements [31, 32] including recent deep learning extensions [33, 34], they share the two conceptual drawbacks detailed above: (i) the data needs to be thinned with a manually specified lag time, which is merely a model artifact and deteriorates the overall temporal resolution, and (ii) the number of metastable states has to be identified manually.

Both problems can be addressed utilizing nonparametric hidden Markov model (HMM) frameworks developed in statistical machine learning [36, 42, 43]: on the one hand, the introduction of a latent process to an MSM abolishes the need to temporally de-correlate the discrete projection of the data via a lag time [31], which can thus be interpreted as a generalization of classical MSMs. On the other hand, nonparametric probabilistic models allow one to specify distributions on unbounded spaces, such as an infinite number of topics in topic modeling [35] or an infinite number of states in a Markov chain [36].

Nonparametric Bayesian HMMs have gained attention in recent years both in experimental settings such as analyses of ion channel switching [37, 38] or single-particle tracking [39] as well as in MD studies [40, 41]. Inference of the meta-stable trajectories and the system parameters was however carried out using sampling techniques such as Markov chain Monte Carlo (MCMC); while yielding accurate results, these approaches do not scale well [42, 43]. Even for the relatively simple problem of one-dimensional ion channel voltage trajectories, they become computationally intractable for longer sequences or higher-dimensional systems.

To address both the conceptual shortcomings of MSMs and the computational tractability issues of conventional sampling approaches, we provide in this paper a scalable nonparametric Bayesian MSM inference framework for the analysis of conformational molecule dynamics. We emphasize that this framework is very widely applicable, including data generated, e.g., by voltage clamp experiments on ion channels as well as MD simulations, and can hence help bridge the gap between theory and experiment. We note also that in terms of modeling, the transition between ion channel experiments and MD simulations is gradual, as the measured ion current in the former can be interpreted as a one-dimensional reaction coordinate in the latter.

Our method does neither require manual specification of a lag time to re-establish Markovianity, nor of the number of meta-stable conformations. We model the switching dynamics between distinct structural conformations via a nonparametric HMM by defining a latent Markov process on a countable set of states (meta-stable protein or channel conformations), of which one obtains only noisy, continuous-valued observations, such as currents or atom positions. We specify noise models that are appropriate for our use cases: in the experimental and computational settings described above, observations typically are real-valued vectors, $x\in }^$, or rotation angles, $x\in ^$. For the angular case, we furthermore propose a novel approximation enabling computational tractability.

To ensure scalability to large amounts of data, we resort to variational methods for inference: instead of drawing samples from the exact posterior distribution, we approximate the latter in a computationally efficient way by distributions of known type [4446]. As we pursue a Bayesian approach, we treat the model parameters as random variables, specifying appropriate prior distributions and deriving their full posterior distributions conditioned on all observed data.

In the following, we will first introduce the general modeling framework and we will in particular address the issue of adequate prior distributions. Subsequently, we show how to perform scalable inference in this setting. Finally, we present our results on synthetic ground-truth data as well as real benchmark and experimental data.

2.1. Bayesian nonparametric Markov state models

We model the conformational molecule dynamics by utilizing the well-known HMM, consisting of two joint stochastic processes , where t is the time index [47]. Note that we use roman upper case letters Zt , Xt to refer to random variables and the corresponding lower case letters zt , xt to refer to particular realizations throughout the paper.

The distinct meta-stable conformational states are represented by the latent Markov states $_\in \mathcal\subseteq \mathbb$; the observed data (e.g., experimentally obtained channel voltages or simulated atom positions) are described by $_\in }^$. The time evolution of this joint process is given as a DTMC on the discrete state space $\mathcal$ governed by a transition probability function $:\mathcal\times \mathcal\to [0,1]$. We represent this as a matrix Π ∈ [0, 1]n×n , whose kth row πk := Πk⋅ specifies the probabilities for transitions to all possible states $l\in \mathcal$ from state $k\in \mathcal$,

Equation (1)

The observation Xt at time point t depends only on the state of the latent process at the same time. This dependency is given by the observation density

Equation (2)

where $\left\_:i=1,\dots ,\vert \mathcal\vert \right\}$ represents a set of generic distribution parameters for each state i. In accordance with the MSM literature, we interpret the HMM as a generalization of MSMs [31], and thus refer to this construct synonymously as hidden MSM.

The key drawback of hidden MSMs regarding the analysis of conformational switching is that the number of meta-stable molecule conformations $\vert \mathcal\vert $ needs to be specified in advance. Typically, however, this number is unknown. Quite on the contrary, it is a key quantity of interest that is to be determined from the data. This shortcoming can be addressed by utilizing a nonparametric modeling approach, which allows for countably infinite state spaces. For any finite data set, $\vert \mathcal\vert $ will however be finite and can hence be learned from the data. In more concrete terms, we specify a model for potentially infinitely many distinct molecular conformations; in any given observed trajectory from simulations or experiments, only a finite number of these conformations will be visited, the number of which can then be identified.

To set up a nonparametric HMM, one needs to construct prior distributions for transition matrices on countably infinite state spaces and for countably infinite observation parameters. This can be achieved via the hierarchical Dirichlet process (HDP) [48], giving rise to the HDP-HMM [36, 42, 43, 45]. In the following, we provide mainly the relevant definitions; we however provide an extended background section on Bayesian nonparametrics in the supporting information (https://stacks.iop.org/PB/19/056006/mmedia). An HDP-HMM is constructed hierarchically in a two-step fashion:

First, specify a Dirichlet process (DP), which is a stochastic process taking values in the space of (discrete) probability measures:

Equation (3)

with concentration parameter α > 0 and base probability measure H0 over some space Θ.

A realization of H1 is obtained by drawing independent and identically distributed (i.i.d.) samples θk ∈ Θ from the base measure H0,

Equation (4)

and assigning to each θk a probability mass σk via a stick-breaking process:

Equation (5)

where Beta(r, s) is the beta distribution with shape parameters r, s > 0 [47]. Equation (5) is compactly written as σ ∼ GEM(α), short for Griffiths–Engen–McCloskey process [48]. A sample H1 ∼ DP(α, H0) accordingly reads

Equation (6)

where $__}$ denotes the Dirac or point measure at θk [49], $__}(\theta )=1$ if θ = θk and 0 otherwise. This procedure results in valid discrete probability measure, ∫dH1 = 1, determining a prior distribution over conformational states: each k represents one distinct conformation, with σk its probability and θk its associated parameterization.

In the second step, H1 serves as base measure of another, subordinate DP: because H1 is discrete, all samples drawn from this subordinate DP have shared support. We consider independent draws

Equation (7)

with the stickiness parameter ξ ⩾ 0, on which we will elaborate in the next paragraph. Decomposing

Equation (8)

with the point measure at index k yields

Equation (9)

As the support of all πk are the shared atoms drawn in equation (4), each πk can be understood as a realization from a probability distribution over a row of a 'countably infinite transition matrix' Π:

Each element of the set θk ∈ corresponds to one latent state k, that is, one molecular conformation, and parameterizes the respective observation distribution,

In other words, the two-step HDP-HMM construction (i) defines the molecular conformations via the measure H1, and (ii) determines their transition dynamics via all πk .

The stickiness parameter introduces a self-transition bias, that is, it extends the sojourn times within each state. For ξ = 0, the classical HDP is recovered [48]. The sticky HDP-HMM has been shown to counter-balance the sensitivity of the classical HDP-HMM to within-state variability, which results in a tendency to introduce redundant states all pertaining to the same ground-truth state; see, e.g., [42]. This is exactly the setting we are interested in, as we are aiming specifically at the analysis of meta-stable states potentially exhibiting a high level of intra-state variability. Note that in comparison to classical MSMs, the stickiness parameter can be understood as a bias towards larger time scales that is to be set by the experimenter. Importantly, our approach does not discard any information, but retains all available data points; the stickiness represents only a bias, but no strict truncation of resolvable time scales. Hence, the minimum time scale this approach is able to resolve is the native time scale of the data points. As with all hyperparameters, we set ξ empirically; see the supporting information for details.

The above definition allows us to formulate the full model distribution. Denoting with $_^\left\_^,\dots ,_^\right\}$ and $_^\left\_^,\dots ,_^\right\}$ the ith of a total of I observed trajectories and with $_\left\_^,\dots ,_^\right\}$, $_\left\_^,\dots ,_^\right\}$, the collection of all trajectories, we can write

Equation (10)

with some initial distributions $p(_^)$, constituting a fully Bayesian nonparametric HMM.

2.2. Observation models and conjugate priors

To fully specify the HDP-HMM, it remains to set up the observation distributions $p(_\vert __})$ for the required spaces $x\in }^$ and $x\in ^$ as well as the corresponding prior distributions H0. For the purpose of inference, it is beneficial to choose priors that are conjugate to the respective likelihoods: a prior of a specific functional form f parameterized by γ, p(θ|γ) ≕ f(θ, γ), is said to be conjugate to a given conditional probability distribution p(x|θ) if the resulting Bayesian posterior distribution p(θ|x) = p(x|θ)p(θ)/p(x) is of the same functional form as the prior, p(θ|x) = f(θ, γ'), with updated parameters γ'. This property simplifies inference, because the computation of the posterior distributions then reduces to computing the parameter updates γ → γ'.

Real-valued data. A versatile model for coordinate data $x\in }^$ as often obtained through MD (e.g., 3D atom positions) as well as electrophysiological experiments is the multivariate normal (MVN) distribution,

Equation (11)

with the mean vector $\mu \in }^$ and the covariance matrix $\in }^$. Generally, we interpret the raw data as noisy observations of the discrete latent conformational states. The MVN is well suited for this purpose due to its unimodality as we aim to identify well-discernible, meta-stable states. For ion channel voltage data, typically $_\in \mathbb$, which is covered by equation (11) as a special case n = 1. Note that normal observation models are frequently used for biophysical experiments [37, 38, 40]. We can hence cover both MD simulation data as well as experimental voltage trajectory data with the same observation model.

For the MVN, a conjugate prior exists termed the normal inverse-Wishart (NIW) distribution, which defines a joint distribution over means μ and covariances Σ:

Equation (12)

with the inverse Wishart (IW) distribution [47]. We combine the likelihood equation (11) with the conjugate prior equation (12) to specify the HDP-HMM for real-valued data.

Angular data. Another natural and widely used way in MD of specifying the spatial arrangement of complex molecules are the dihedral angles between adjacent atom or molecule planes [50]. Hence, data are often angular and constrained to the unit circle, $_\in ^$. Observation models particularly suited for these spaces are von Mises (vM) type distributions [51, 52]. Since it is customary to characterize amino acid chains such as proteins by sets of pairs of torsion angles (ϕi , ψi ), we focus on the two-dimensional case here; the extension to multiple pairs is then straightforward. We utilize a well-known and in the context of protein modeling established parameterization of the bivariate vM distribution [53] for $_\equiv (\phi ,\psi )\in ^$,

Equation (13)

where

Equation (14)

and Ii is the modified Bessel function of the first kind and order i. The location parameters ζ and ν control the position of the mode of the distribution, as can be seen from the trigonometric terms in equation (13). The parameters κ1, κ2, κ3 specify the spatial correlations. Note that marginalizing over ϕ and setting κ1 = κ3 = 0 recovers the conventional one-dimensional vM distribution,

Equation (15)

While analytical expressions for a conjugate prior exist also for the bivariate vM distribution [51], the infinite sum of Bessel functions in equation (13) renders the normalizer c intractable in a Bayesian setting. Computing the exact posterior vM is hence not possible, as this requires computing integrals over all κ parameters in equation (14). Additionally, this distribution is not guaranteed to be unimodal; intricate conditions exist on the relation of the concentration parameters κ1, κ2, κ3 to achieve unimodality [52]. For high concentration values in specific regimes, however, it is known that the bivariate vM distribution is well approximated by a bivariate normal distribution [51]. This is unsurprising, as in general, vM-type distributions and normal distributions are tightly linked: the former can be constructed from the latter [54]. To ensure tractability and interpretability, we utilize this circumstance and propose an approximation via

Equation (16)

The mode position ζ, ν roughly corresponds to the mean vector μ; the covariance depends on the κ-parameters. The precise analytical expressions for these dependencies are rather involved and not relevant to our approximation—we hence refer the interested reader to [51, 52] for an in-depth analysis. As we focus on systems exhibiting distinct, separable meta-stable states, we in fact expect peaked angular distributions, for which this approximation is valid. To gain a better intuition, see also figure 1, where we compare a one-dimensional vM with the corresponding normal distribution; as is immediately clear, for high concentration values the approximation error becomes negligible. Additionally, equation (16) allows for straightforward debugging: as long as the probability assigned to the area outside the unit circle is small, the approximation can be assumed valid; vice versa, it deteriorates if this probability becomes non-negligible. We consequentially accept this error in the observation model to arrive at a tractable expression. Note that the gained tractability may greatly aid the practical utility of the framework, as it is otherwise also customary to resort to 3D coordinates to avoid mathematical complexity, disregarding crucial structural information about the biological problem [40]. In the following, we refer to our approximation as the approximate vM model.

Figure 1. Comparison of a one-dimensional vM with the approximate vM model. Left: inferred approximate vM (red dashed line) for N = 1000 data points generated from a low-concentration vM (κ = 0.3, blue line). Gray shaded area: unit circle. Red shaded area: probability mass outside the unit circle boundaries. Right: inferred approximate vM for N = 1000 data points generated from a high-concentration vM κ = 10. No significant probability mass is placed outside the unit circle.

Standard image High-resolution image 2.3. Scalable inference of meta-stable states

The goal of Bayesian inference is to compute the posterior distribution over the latent sequences z[1,T] and the model parameters given the observed trajectories x[1,T],

Equation (17)

This distribution cannot be evaluated analytically. In principle, one can employ standard sampling techniques such as MCMC and obtain the posterior empirically [36]. In our case, however, the typically large data sets from simulations or long-duration experiments (see, e.g., the discussion in [37]), render this computationally infeasible, because every draw from the posterior requires one full pass through the data.

To alleviate these computational issues, we utilize a variational inference (VI) approach. The core idea of VI methods is to cast the inference task as an optimization problem. We briefly lay out the general method, but refer the interested reader to, e.g., [44]. The objective is to find an approximate (or variational) distribution q* that minimizes the Kullback–Leibler (KL) divergence to the exact posterior p, equation (17):

Equation (18)

This KL divergence is still computationally intractable because of the evidence p(x[1,N]) in the denominator of equation (17). The evidence is in principle obtained by integrating over all unobserved model components, p(x) = ∑z ∫p(x, z, Π, θ, σ)dθ dΠ dσ, which requires a summation of $\vert \mathcal^$ terms, each of which contains the full integrals over the parameters Π, θ and σ and hence is impossible to compute for realistic state space sizes and sequence lengths.

The optimization problem equation (18) can be however transformed into an equivalent, but tractable problem. To do so, one re-writes the KL divergence as

Equation (19)

where

Equation (20)

and $\,}_$ is the expectation operator, where the expectation is to be taken with respect to q. This yields a lower bound on the log evidence, $\mathcal\leqslant \mathrm\,p(_)$, since the KL divergence satisfies KL(q||p) ⩾ 0 for any two distributions q, p. Accordingly, the quantity $\mathcal$ is termed the evidence lower-bound (ELBO). Since the log evidence is constant with respect to the model parameters, minimizing the KL divergence is equivalent to maximizing the ELBO. The intractable computation of the log evidence is hence not needed to evaluate $\mathcal$.

Without further assumptions, maximization of equation (20) yields the exact posterior, q* = p(z[1,T], Π, θ|x[1,T]), but does not provide a practical way of actually performing the optimization. To enable a practical computational scheme, we employ a standard mean-field assumption on the variational distributions [44]:

Equation (21)

This enables a computationally tractable, iterative coordinate-wise ascent optimization procedure [47]: one variational factor of equation (21) is optimized at a time while keeping all others fixed, and one pass through all variational factors constitutes a VI iteration. The generic distribution update for any quantity $\alpha \in \left\_,_\right\}}_,_\right\}}_,\sigma \right\}$ is obtained as

留言 (0)

沒有登入
gif