SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

Single-cell isolation, whole-genome amplification and sequencing

We isolated EpCAM+ cells from one normal and three tumoural regions (TP: tumour proximal; TC: tumour central; TD: tumour distal) from the patient with a BD FACSAria III cytometer. We successfully amplified the genomes of 28 tumour cells and 18 normal cells with Ampli1 (Silicon Biosystems) and built whole-genome sequencing libraries using the KAPA (Kapa Biosystems) library kit. Each library was sequenced at $\approx$6× on an Illumina Novaseq 6000 at the Spanish National Center of Genomic Analysis (CNAG-CR; https://www.cnag.crg.eu/). We called this dataset CRC28.

Data preprocessing

For the public TNBC16 [42] and CRC48 [43] datasets, we downloaded the raw sequencing reads from the SRA database in FASTQ format. For the three datasets (CRC28, TNBC16 and CRC48) we trimmed the Illumina adapter sequences using cutadapt (version 1.18) and mapped reads to the 1000G Reference Genome hs37d5 using BWA MEM (version 0.7.17). After de-duplication with Picard (version 2.18.14), we used GATK (version 3.7.0) for local realignment based on indel calls from the 1000G Phase 1 and the Mills and 1000G gold standard. Subsequently, we recalibrated the base scores using GATK (version 4.0.10) with polymorphisms from dbSNP (build 138) and indels from the 1000G Phase 1. Exact commands used to run the tools are featured in Supplementary Note.

SIEVE model

SIEVE is a statistical approach which combines a statistical phylogenetic model with a probabilistic model of raw read counts. We implement SIEVE under BEAST 2 [34], a popular Bayesian phylogenetic framework that uses Markov Chain Monte Carlo (MCMC) for the estimation of phylogenetic trees and model parameters.

Input data

SIEVE takes as input raw read counts of all four nucleotides at candidate SNV sites (Fig. 1a). Specifically, for cell $j \in \$ at candidate SNV site $i \in \$, the input data to SIEVE is in the form of $\mathcal _^ = (\boldsymbol_}, c_)$, where $\boldsymbol_} = \ \,|\, k = 1, 2, 3\}$ corresponds to the read counts of three alternative nucleotides with values in descending order and $c_$ to the sequencing coverage for cell j and site i. Candidate SNV sites are defined as statistically significant SNVs that could potentially occur in single cells (see the “Candidate site identification” section).

For scWGS and scWES datasets, raw read counts from $I^\prime$ background sites are denoted $\mathcal ^$. The number of background sites is used to correct acquisition bias (see the “SIEVE likelihood” section). For datasets lacking background information (for instance, from targeted sequencing), SIEVE accepts a user-specified number of background sites only for acquisition bias correction.

Candidate site identification

To identify candidate variant sites, we employ a strategy similar to SCIPhI [15]. Specifically, a likelihood ratio test is conducted for SNV detection, but with a modification enabling to capture sites containing double mutant genotypes. To this end, the Beta-Binomial distribution is fitted with free mean and overdispersion parameters at each site across all cells with non-zero variant read counts, and the corresponding likelihood is denoted $L_1$. Next, another constrained Beta-Binomial distribution is fitted using the same set of cells with fixed mean being 0.25 and free overdispersion, whose likelihood is denoted $L_0$. As a result, the test statistic $-2\log \frac$ asymptotically follows the $\chi ^2$ distribution with degrees of freedom being 1. The null hypothesis ($H_0$) is thus that the mean $= 0.25$, and the alternative hypothesis ($H_1$) is that the mean $\ne 0.25$. A site is classified as candidate variant when the corresponding p-value is larger than 0.05 or the fitted mean is larger than 0.25. This analysis is performed on tumour cells. Normal cells are additionally used to filter out germline mutations. This candidate site identification procedure is implemented in a tool named DataFilter.

The sites identified by DataFilter are referred to as ‘candidate’ since they could sometimes be false discoveries due to technical errors in scDNA-seq. Moreover, the actual variant calling, i.e. determination of whether the variant occurs in each of the candidate sites in each cell is performed by SIEVE, and not DataFilter. Notably, all other methods that identify evolutionary trees, including CellPhy [26], SiFit [22], or SCIPhI [15], require an input either actual variants in each cell (CellPhy, SiFit) or the candidate variant sites (SCIPhI). The identification of these candidate sites is crucial for model performance, as it limits the number of sites where the variation may occur, which is much smaller compared to the full set of all possible sites.

Statistical phylogenetic model

The statistical phylogenetic model behind SIEVE includes an instantaneous transition rate matrix, which is defined by a continuous-time homogeneous Markov chain. We consider four possible genotypes $G = \^ \}$, where 0, 1, and $^$ are used to denote the reference nucleotide, an alternative nucleotide, and a second alternative nucleotide which is different from that denoted by 1, respectively. The fundamental evolutionary events we consider are single mutations and single back mutations. The former happen when 0 mutates to 1, or 1 and $^$ mutate to each other, while the latter occur when 1 or $^$ mutates to 0. Hence, genotypes 0/0 and 0/1 represent wildtype and single mutant genotypes, respectively, whereas genotype 1/1 and $1/^$ represent double mutant genotypes. We intentionally use the non-standard nomenclature of single and double mutants to discern important evolutionary events. In contrast, calling both 0/1 and $1/^$ a heterozygous mutation genotype would be more standard and correct, but would not differentiate between the genotype that has only a single allele changed with respect to the reference (0/1) from the genotype that has two alleles changed ($1/^$). We only consider unphased genotypes, so we do not differentiate between 0/1 and 1/0 or between $1/^$ and $^\!/1$.

The joint conditional probability of all cells at SNV site i having genotype $g_ \in G, j = 1, \dots , J$ is determined according to the statistical phylogenetic model by

$$\begin P \left( \left. \boldsymbol^)}_} \,\right| \, \mathcal , \boldsymbol, Q, h, \eta \right) = \sum \limits _^)}_} \setminus g_} P \left( \left. \boldsymbol^)}_}, \boldsymbol^)}_} \setminus g_ \,\right| \, \mathcal , \boldsymbol, Q, h, \eta \right) . \end$$

(1)

In Eq. (1), $\boldsymbol$ represents the branch lengths measured by the expected number of somatic mutations per site and Q is the instantaneous transition rate matrix of the Markov chain. $\mathcal $ is the rooted binary tree topology, representing the genealogical relations among cells. We specifically require the root of $\mathcal $ to have only one child, representing the most recent common ancestor (MRCA) of all cells. The branch between the root and the MRCA is the trunk of the cell phylogeny. The trunk is one of novelties of our approach, introduced to represent the accumulation of clonal mutations (shared among all cells) in the initial phase of tumour progression. Therefore, with J existing cells, labelled by $\$, as leaves, $\mathcal $ has J internal hidden ancestor nodes, labelled by $\$, and $2J - 1$ branches, whose lengths are kept in $\boldsymbol$. The trunk is essential for $\mathcal $ to assure that the root, labelled by 2J, represents a normal ancestor cell even if the data only contains tumour cells. Hence the genotype of the root for SNV site i, denoted $g_$, is fixed to 0/0. $\boldsymbol^)}_}$ represents the genotypes of J cells as leaves of $\mathcal $, while $\boldsymbol^)}_}$ is the genotypes of all ancestor cells as internal nodes of $\mathcal $. Note that we marginalise the genotypes of the ancestor nodes except for the root. We also consider among-site substitution rate variation following a discrete Gamma distribution with mean equal 1, parameterised by the number of rate categories h and shape $\eta$ [45]. $\mathcal , \boldsymbol, \eta$ in Eq. (1) are hidden variables, estimated using MCMC (see the “Posterior and MCMC” section), whereas h is a hyperparameter that is fixed (4 by default). Note that variant calling effectively corresponds to the determination of the values of the variables $\boldsymbol^)}_}$.

In the transition rate matrix Q (Fig. 1c), each entry denotes a rate from one genotype to another during an infinitesimal time interval $\Delta t$. Note that at most one change is allowed to occur in $\Delta t$. For instance, the transition of 0/0 moving to 1/1 during $\Delta t$ is impossible as two single somatic mutations are required; thus, the corresponding transition rate is 0. The transition rate from genotype 0/0 to 0/1 represents the somatic mutation rate and is set to 1. The back mutation rate is measured relatively to the somatic mutation rate and therefore is $^/_$.

With the genotype state space G defined, for a given branch length $\beta$, the underlying four-by-four transition probability matrix $R(\beta )$ of the Markov chain is represented using matrix exponentiation of the product of Q and $\beta$ as $R(\beta ) = \exp (Q \beta )$ [27].

Model of raw read counts

The probability of observing the input data $\mathcal _$ for cell j at site i is factorised as

$$\begin P(\mathcal _) = P(\boldsymbol_} \,|\, c_) P(c_), \end$$

(2)

where the first component is the model of nucleotide read counts and the second the model of sequencing coverage.

Model of sequencing coverage

After single-cell whole-genome amplification (sc-WGA) some genomic regions are more represented than others. After scDNA-seq, this results in an uneven coverage along the genome, much more than in the case of bulk sequencing. Here, to model the sequencing coverage c in the presence of overdispersion, we employ a negative binomial distribution.

$$\begin P(c \,|\, p, r) = \left( \begin c + r - 1\\ r - 1 \end\right) p^r (1 - p)^c, \end$$

(3)

with parameters p and r. We reparameterise the distribution with $p = ^/_$ and $r = ^/_$, where $\mu$ and $\sigma ^2$ are the mean and the variance of the distribution of the sequencing coverage c, respectively.

Theoretically, each cell j at site i has its specific $\mu _$ and $\sigma _^2$ parameters, which, however, are impossible to be estimated freely. Hence, we make additional assumptions and pool the data for better estimates, adapting the approach of [46]. We assume that $\mu _$ and $\sigma ^2_$ have the following forms, respectively:

$$\begin \mu _&= \alpha _ t s_j, \nonumber \\ \sigma ^2_&= \mu _ + \alpha _^2 v s_j^2. \end$$

(4)

In Eq. (4), t is the mean of allelic coverage (the expected coverage per allele) and v is the variance of allelic coverage. We estimate t and v with MCMC (see the “Posterior and MCMC” section). $\alpha _ \in \$ is a hidden random variable denoting the number of sequenced alleles for cell j at site i. According to the statistical phylogenetic model, both alleles are expected to be sequenced. However, due to the frequent occurrence of allelic dropout (ADO) during scWGA, there are cases where only one allele is amplified and therefore $\alpha _$ is 1. Equation (4) reflects the fact that the expected sequencing coverage and its raw variance are proportional to the number of sequenced alleles. Note that inferring the hidden variable $\alpha _$ corresponds to identifying occurrences of ADO events, and hence the ability of SIEVE to perform ADO calling. We denote the prior distribution of $\alpha _$

$$\begin \left\ P(\alpha _ = 1 \,|\, \theta ) = \theta , \text \\ P(\alpha _ = 2 \,|\, \theta ) = 1 - \theta , \text , \end\right. \end$$

(5)

where $\theta$ is a parameter corresponding to the the probability of ADO occurs, i.e. the ADO rate, which is estimated using MCMC.

In Eq. (4), $s_j$ is the size factor of cell j which makes sequencing coverage from different cells comparable and is estimated directly from the sequencing coverage using

$$\begin \hat_j = \underset \ne 0}} \frac} j^\prime = 1 \\ c_ \ne 0 \end}^ c_ \right) ^}}, \end$$

(6)

where $J^\prime$ is the number of cells with non-zero coverage at a site. By taking into account only the non-zero values, the estimate $\hat_j$ is not affected by the missing data, which is prevalent in scDNA-seq.

Model of nucleotide read counts

We denote the genotype affected by ADO $g_^\prime \in G \, \bigcup \, \, 1/\text \}$, where $0/\text $ and $1/\text $ are the results of ADO occurring to $g_$. For instance, $0/\text $ is caused either by 0 dropped out from 0/0 or by 1 dropped out from 0/1. Then the probability of $g_^\prime$ is denoted by

$$\begin P \left( \left. g_^\prime \,\right| \, g_, \alpha _ \right) , \end$$

(7)

which is defined at length in Table 2.

Table 2 Definition of the distribution of $g_^\prime$ conditional on $g_$ and $\alpha _$

We model the read counts of three alternative nucleotides $\boldsymbol_}$ given the sequencing coverage $c_$ with a Dirichlet-multinomial distribution as

$$\begin P(\boldsymbol_} \,|\, c_, \boldsymbol_}) = \frac, a_)} > 0}^3 F(m_, a_) F(c_ - \sum \nolimits _^3 m_, a_)}, \end$$

(8)

with parameters $\boldsymbol_} = \ \,|\, k = 1, \dots , 4 \}$ and $a_ = \sum \nolimits _^4 a_$. F is a function in the form of

$$\begin F(x,y) = \left\ x B(y, x), \text x > 0 \text \\ 1, \text \end\right. \end$$

(9)

where B is the beta function. Note that $c_ - \sum _^3 m_$ is the read count of the reference nucleotide.

To improve the interpretation of Eq. (8), we reparameterise it with $\boldsymbol_} = w_ \boldsymbol_}$, where $\boldsymbol_} = \ \,|\, k = 1, \dots , 4 \}, \sum _^4 f_ = 1$ is a vector of expected frequencies of each nucleotide and $w_$ represents overdispersion. $\boldsymbol_}$ are categorical hidden variables dependent on $g_^\prime$:

$$\begin \boldsymbol_} = \left\ \boldsymbol_} = \left( \frac f, \frac f, \frac f, 1 - f \right) , \text g_^\prime = 0/0 \text 0/\text , \\ \boldsymbol_} = \left( \frac - \frac f, \frac f, \frac f, \frac - \frac f \right) , \text g_^\prime = 0/1, \\ \boldsymbol_} = \left( 1 - f, \frac f, \frac f, \frac f \right) , \text g_^\prime = 1/1 \text 1/\text , \\ \boldsymbol_} = \left( \frac - \frac f, \frac - \frac f, \frac f, \frac f \right) , \text g_^\prime = 1/1^, \end\right. \end$$

(10)

where f is the expected frequency of nucleotides whose existence is solely due to technical errors during sequencing. To be specific, f is defined as the effective sequencing error rate including amplification (where a nucleotide is wrongly amplified into another one during scWGA) and sequencing errors.

$w_$ is also a categorical hidden variable dependent on $g_^\prime$:

$$\begin w_ = \left\ w_1, \text g_^\prime = 0/0, 0/\text , 1/1, \text 1/\text , \\ w_2, \text g_^\prime = 0/1 \text 1/1^, \end\right. \end$$

(11)

where $w_1$ is wildtype overdispersion and $w_2$ is alternative overdispersion.

By plugging in Eqs. (10) and (11), (8) is equivalently represented with

$$\begin P(\boldsymbol_} | c_, g_^\prime , f, w_) = \left\ P_ = P \left( \left. \boldsymbol_} \,\right| \, c_, g_^\prime = 0/0, \boldsymbol_}, w_1 \right) , \\ P_} = P \left( \left. \boldsymbol_} \,\right| \, c_, g_^\prime = 0/\text , \boldsymbol_}, w_1 \right) , \\ P_ = P \left( \left. \boldsymbol_} \,\right| \, c_, g_^\prime = 0/1, \boldsymbol_}, w_2 \right) , \\ P_ = P \left( \left. \boldsymbol_} \,\right| \, c_, g_^\prime = 1/1, \boldsymbol_}, w_1 \right) , \\ P_} = P \left( \left. \boldsymbol_} \,\right| \, c_, g_^\prime = 1/\text , \boldsymbol_}, w_1 \right) , \\ P_^} = P \left( \left. \boldsymbol_} \,\right| \, c_, g_^\prime = 1/1^, \boldsymbol_}, w_2 \right) . \end\right. \end$$

(12)

Note that $P_$ and $P_}$ share the same $\boldsymbol$ and $w_1$, showing that the model of nucleotide read counts is not enough to discriminate 0/0 from $0/\text $, and so do $P_$ and $P_}$. In such cases, incorporating the model of sequencing coverage helps resolve the entanglement.

To understand Eq. (12), first take $P_$ as an example. Theoretically, no alternative nucleotides are supposed to exist if no technical errors occur. Thus, any observations of any alternative nucleotides can only result from technical errors, and the expected frequency of the reference nucleotide is accordingly adjusted to $1 - f$. For another example $P_$, say the reference nucleotide is A and the alternative nucleotide is C, and both their read count frequencies are supposed to be $^/_$ if no technical errors occur. For the other two alternative nucleotides, G and T, their observations could only result from technical errors, and both their frequencies are $^/_$. Moreover, either A or C may be sequenced as a different nucleotide (each with probability 1/2). In the former case, the frequency of A decreases by $^/_$. In the latter case, if C is sequenced as A (with probability $^/_$) the frequency of A increases by $^/_ \times ^/_$. Overall, the frequency of A decreases by $^/_$, resulting in $^/_ - ^/_$.

f, $w_1$ and $w_2$ in Eq. (12) are estimated with MCMC.

SIEVE likelihood

We denote the conditional variables in Eq. (1) as $\Theta = \, \boldsymbol, Q, h, \eta \}$ and those in the model of raw read counts as $\Phi = \$. Given the input data $\mathcal ^$ and $\mathcal ^$, the log-likelihood of the SIEVE model is

$$\begin \log \mathcal (\Theta , \Phi ) = \log \mathcal ^(\Theta , \Phi ) + \log \mathcal ^(f, w_1), \end$$

(13)

where $\mathcal ^$ is the tree likelihood corrected for acquisition bias computed from candidate SNV sites in $\mathcal ^$, while $\mathcal ^$ is the likelihood computed from background sites in $\mathcal ^$, referred to as the background likelihood. Equation (13) does not contain $g_, g_^\prime , \alpha _$ since they are marginalised out (see below).

Since we only use data from SNV sites to compute the tree likelihood, the tree branch lengths $\boldsymbol$ are prone to be overestimated [29, 30]. The overestimation of $\boldsymbol$ due to only using data from SNV sites is called acquisition bias, which is corrected in SIEVE according to [47]:

$$\begin \log \mathcal ^ = \log P \left( \left. \mathcal ^ \,\right| \, \Theta , \Phi \right) + I^\prime \log \left( \frac \sum\limits_^I C_i \right) , \end$$

(14)

where the first component is the uncorrected tree log-likelihood for SNV sites, and $C_i$ in the second component is the likelihood of SNV site i being invariant (see below). The regularisation term $I^\prime \log \big ( \frac \sum _^I C_i \big )$ renders SIEVE in favour of trees with short branch lengths where $\mathcal ^$ is large due to the increasing averaged C.

To compute the uncorrected tree log-likelihood, we marginalise out $\alpha _$ and $g_^\prime$:

$$\begin P(\boldsymbol_}, c_ | g_, \Phi )= & P(\boldsymbol_}, c_ | g_, f, w_, t, v, \theta ) \nonumber \\= & \sum\limits_, g_^\prime } P \left( \left. \boldsymbol_}, c_, \alpha _, g_^\prime \,\right| \, g_, f, w_, t, v, \theta \right) \nonumber \\= & \sum\limits_, g_^\prime } P \left( \left. \boldsymbol_} \,\right| \, c_, g_^\prime , f, w_ \right) P \left( \left. g_^\prime \,\right| \, g_, \alpha _ \right) \nonumber \\&\qquad \quad \times P(c_ \,|\, \alpha _, t, v) P(\alpha _ \,|\, \theta )\nonumber \\= & \left\ \begin P_ & \cdot P(c_ \,|\, \alpha _ = 2, t, v) \cdot (1 - \theta ) \\ & + P_} \cdot P(c_ \,|\, \alpha _ = 1, t, v) \cdot \theta , \text g_ = \text , \end\\ \begin P_ & \cdot P(c_ \,|\, \alpha _ = 2, t, v) \cdot (1 - \theta ) \\ & + \frac (P_} + P_}) \cdot P(c_ | \alpha _ = 1, t, v) \cdot \theta , \text g_ = \text , \end\\ \begin P_ & \cdot P(c_ \,|\, \alpha _ = 2, t, v) \cdot (1 - \theta ) \\ & + P_} \cdot P(c_ \,|\, \alpha _ = 1, t, v) \cdot \theta , \text g_ = \text , \end\\ \begin P_^} & \cdot P(c_ \,|\, \alpha _ = 2, t, v) \cdot (1 - \theta ) \\ & + P_} \cdot P(c_ \,|\, \alpha _ = 1, t, v) \cdot \theta , \text g_ = 1/^, \end \end\right. \end$$

(15)

where $P_, P_}, P_, P_, P_}, P_^}$ are defined in Eq. (12) and $P \left( \left. g_^\prime \,\right| \, g_, \alpha _ \right)$ is defined in Eq. (7). In the second line of Eq. (15), the probability is factorised out according to Fig. 1b.

To compute $\log P \left( \left. \mathcal ^ \,\right| \, \Theta , \Phi \right)$ in Eq. (14), we assume that the SNV sites evolve independently and identically. By plugging Eqs. (1) and (15), $\log P \left( \left. \mathcal ^ \,\right| \, \Theta , \Phi \right)$ is denoted by

$$\begin \log P \left( \left. \mathcal ^ \,\right| \, \Theta , \Phi \right)= & \sum\limits_^I \log \sum\limits_^)}_}} P \left( \left. \mathcal _i^ \,\right| \, \boldsymbol^)}_}, \Phi \right) \sum\limits_^)}_} \setminus g_} P \left( \left. \boldsymbol^)}_}, \boldsymbol^)}_} \setminus g_ \,\right| \, \Theta \right) \nonumber \\= & \sum\limits_^I \log \sum\limits_^)}_}} \Bigg [\prod\limits_^J P(\boldsymbol_}, c_ \,|\, g_, \Phi ) \nonumber \\&\qquad \qquad \qquad \quad \times \sum\limits_^)}_} \setminus g_} P \left( \left. \boldsymbol^)}_}, \boldsymbol^)}_} \setminus g_ \,\right| \, \Theta \right) \Bigg ]\nonumber \\= & \sum\limits_^I \sum _^J \log \sum\limits_^)}_}, \boldsymbol^)}_} \setminus g_} \bigg [ P(\boldsymbol_}, c_ \,|\, g_, \Phi )\nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \quad \times P \left( \left. \boldsymbol^)}_}, \boldsymbol^)}_} \setminus g_ \,\right| \, \Theta \right) \bigg ], \end$$

(16)

which is efficiently computed out by Felsenstein’s pruning algorithm [48], with the extension of the model of raw read counts applied on leaves. Specifically, the Fenselstein’s pruning algorithm is applied to an extended tree $\mathcal $, where additional leaf nodes corresponding to the data are attached at the bottom of $\mathcal $: for each node corresponding to genotype $g_$ there is a leaf node added, corresponding to data $(\boldsymbol_}, c_)$, and the transition probability between the genotype node and the leaf is given by Eq. (15). For I candidate SNV sites, J cells and K genotype states in G (for SIEVE $K = 4$), the time complexity of Felsenstein’s pruning algorithm is $\mathcal (I J K^2)$.

$C_i$ in Eq. (14) is determined similarly to Eq. (16) by computing the joint probability of observing the data $\mathcal _i^$ and $\boldsymbol^)}_} = 0/0$:

$$\begin C_i= & P \left( \left. \mathcal _i^, \boldsymbol^)}_} = 0/0 \,\right| \, \Theta , \Phi \right) \nonumber \\= & P \left( \left. \mathcal _i^ \,\right| \, \boldsymbol^)}_} = 0/0, \Phi \right) \sum\limits_^)}_} \setminus g_} P \left( \left. \boldsymbol^)}_} = 0/0, \boldsymbol^)}_} \setminus g_ \,\right| \, \Theta \right) \nonumber \\= & \prod\limits_^J P \left( \left. \boldsymbol_}, c_ \,\right| \, g_ = 0/0, \Phi \right) \sum\limits_^)}_} \setminus g_} P \left( \left. \boldsymbol^)}_} = 0/0, \boldsymbol^)}_} \setminus g_ \,\right| \, \Theta \right) . \end$$

(17)

Formally, to compute the background likelihood, we should account for the fact that the background sites, similarly to the variant sites, also evolve under the phylogenetic model and involve similar computations as above. This, however, would result in a large additional computational burden due to the large number of background sites compared to the variant sites. Thus, to estimate the background log-likelihood efficiently, we make several simplifications and compute it only approximately. First, we assume that across $I^\prime$ background sites each cell has the same genotype 0/0 and both alleles are covered. We further ignore the model of sequencing coverage and the tree log-likelihood in the computations. As a result, by employing an alternative expression of Dirichlet-multinomial distribution $\log \mathcal ^$ is efficiently obtained as

$$\begin \log \mathcal ^(f, w_1)= & \sum\limits_^ \sum\limits_^J \log P_ \nonumber \\= & \sum\limits_^ \sum\limits_^J \log \Bigg [ \frac + 1)} + w_1)} \prod\limits_^3 \frac + \frac f w_1 \big ) } f w_1 \big ) \Gamma (m_ + 1)}\nonumber \\ &\qquad \qquad \qquad \times \frac - \sum\nolimits_^3 m_ + (1 - f) w_1 \big )} - \sum\nolimits_^3 m_ + 1 \big ) } \Bigg ]\nonumber \\= & I^\prime J \left[ \log \Gamma (w_1) - 3 \log \Gamma \left( \frac f w_1 \right) - \log \Gamma (( 1 - f) w_1) \right] \nonumber \\&+ \sum\limits_^)} N_c ( \log \Gamma (c + 1) - \log \Gamma (c + w_1)) \nonumber \\&+ \sum _^3 \sum\limits_^)} N_ \left( \log \Gamma \left( m_k + \frac f w_1 \right) - \log \Gamma (m_k + 1) \right) \nonumber \\ &+ \sum\limits_^3 m_k = 1}^ - \sum\nolimits_^3 m_)} N_^3 m_k} \Bigg ( \log \Gamma \left( c - \sum\limits_^3 m_k + (1 - f) w_1 \right) \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad - \log \Gamma \left( c - \sum\limits_^3 m_k + 1 \right) \Bigg ), \end$$

(18)

where $P_$ is defined in Eq. (12). $N_c$, $N_$ for $k = 1, 2, 3$ and $N_^3 m_k}$ represent, across $I^\prime$ background sites and J cells, the unique occurrences of sequencing coverage c, of alternative nucleotide read counts $m_1, m_2, m_3$, and of reference nucleotide read counts $c - \sum _^3 m_k$, respectively. In Eq. (18), some items, namely $\log \Gamma (c + 1)$, $- \log \Gamma (m_k + 1)$ for $k = 1, 2, 3$, and $- \log \Gamma \big ( c - \sum _^3 m_k + 1 \big )$, only depends on the data, which remain constants during MCMC. Therefore, they are ignored in the computation of background likelihood. It is clear that the background likelihood helps estimate f and $w_1$.

The time complexity of Eq. (18) is $\mathcal (c)$ with c being the number of unique values of sequencing coverage across all cells and background sites. Since $I J K^2$ is usually much larger than c, the overall time complexity of model likelihood is $\mathcal (I J K^2)$.

Priors

To define priors for model parameters and for the tree coalescent, we employ the prior distributions defined in BEAST 2. We impose on $\mathcal $ and $\boldsymbol$ in Eq. (1) a prior distribution following the Kingman coalescent process with an exponentially growing population. The tree prior is parameterised by scaled population size M and exponential growth rate q, and is denoted by

$$\begin P(\mathcal , \boldsymbol \,|\, M, e), \end$$

(19)

whose analytical form is defined in [49]. M and e are hidden random variables and are estimated using MCMC. Note that, by default, M represents the number of time units, e.g. the number of years, and the mutation rate is measured by the number of mutations per time unit per site. Their product results in the unit of branch length, i.e. the number of mutations per site. Since scDNA-seq data usually does not contain temporal information as a result of collecting samples at the same time, it is impossible to differentiate M from the mutation rate. However, if the mutation rate is known, one could alternatively estimate a time-calibrated cell phylogeny.

As prior distributions, we assign to M

$$\begin P(M \,|\, \delta ) = \frac, \end$$

(20)

where $\delta$ is the current proposed value of M. Note that this is supposed to be normalised to define a proper probability distribution, but this form is sufficient to define a proper posterior (see the “Posterior and MCMC” section).

For e, we choose

$$\begin e \,|\, \lambda , \epsilon \,\sim \, \text (\lambda , \epsilon ), \end$$

(21)

where we choose mean $\lambda = 10^$ and scale $\epsilon = 30.7$ (default in the BEAST 2 software). We choose an exponential distribution as the prior for $\eta$ in Eq. (1):

$$\begin \eta \,|\, \gamma \,\sim \, \exp (\gamma ), \end$$

(22)

where $\gamma = 1$.

For the model of sequencing coverage described in Eqs. (3) and (4), we set the prior for t within a large range of values with

$$\begin t \,|\, \rho \,\sim \, \text (0, \rho ), \end$$

(23)

where $\rho = 1000$, and the prior for v with

$$\begin v \,|\, \zeta \,\sim \, \exp (\zeta ), \end$$

(24)

where $\zeta = 25$. In terms of $\theta$ in Eq. (5), it also has a uniform prior:

$$\begin \theta \,|\, u \,\sim \, \text (0, u), \end$$

(25)

where $u = 1$.

For the model of nucleotide read counts described in Eqs. (10) to (12), we choose an exponential prior for f:

$$\begin f \,|\, \tau \,\sim \, \exp (\tau ), \end$$

(26)

where $\tau = 0.025$, and a log normal prior for both $w_1$ and $w_2$:

$$\begin w_1 \,|\, \xi _1, \psi _1&\,\sim \, \text (\xi _1, \psi _1), \nonumber \\ w_2 \,|\, \xi _2, \psi _2&\,\sim \, \text (\xi _2, \psi _2), \end$$

(27)

where we choose for $w_1$ the log-transformed mean $\xi _1 = 3.9$ (150 for untransformed) and the standard deviation $\psi _1 = 1.5$, and for $w_2$ the log-transformed mean $\xi _2 = 0.9$ (10 for untransformed) and the standard deviation $\psi _2 = 1.7$. Specifically, the mean is log-transformed using

$$\begin \xi _} = \log (\xi _}) - \frac. \end$$

These specific values reflect our belief that $w_1$ is greater than $w_2$, and are chosen in such a way that both distributions cover a large range of possible values for $w_1$ and $w_2$.

Posterior and MCMC

With the model likelihood and priors defined, the posterior distribution of the unknown parameters is

$$\begin P \left( \left. \mathcal , \boldsymbol, M, e, \eta , t, v, \theta , f, w_1, w_2 \,\right| \, \mathcal ^, \mathcal ^ \right)= & \frac P \left( \left. \mathcal ^, \mathcal ^ \,\right| \, \mathcal , \boldsymbol, \eta , t, v, \theta , f, w_1, w_2 \right) \nonumber \\&\times P(\mathcal , \boldsymbol \,|\, M, e) P(M \,|\, \delta ) P(e \,|\, \lambda , \epsilon ) P(\eta \,|\, \gamma ) \nonumber \\&\times P(t \,|\, \rho ) P(v \,|\, \zeta ) P(\theta \,|\, u) P(f \,|\, \tau ) \nonumber \\&\times P(w_1 \,|\, \xi _1, \psi _1) P(w_2 \,|\, \xi _2, \psi _2), \end$$

(28)

where Z is a normalisation constant, representing the probability of the observed data.

Since the posterior distribution does not have a closed-form analytical formula, we employ the MCMC algorithm with Metropolis-Hastings kernel to sample from the posterior distribution in Eq. (28). Given the current state of the parameters q, we propose a new state $q^*$ according to proposal distributions $P(q^*| q)$ that assure the reversibility and ergodicity of the Markov chain. With one parameter changed a time, $q^*$ is accepted with probability

$$\begin \min \left\^*, \boldsymbol^*, M^*, e^*, \eta ^*, t^*, v^*, \theta ^*, f^*, w_1^*, w_2^*\,\right| \, \mathcal ^, \mathcal ^ \right) P(q \,|\, q^*)}, \boldsymbol, M, e, \eta , t, v, \theta , f, w_1, w_2 \,\right| \, \mathcal ^, \mathcal ^ \right) P(q^*\,|\, q)} \right\} , \end$$

(29)

where the normalisation constant Z cancels out after plugging in Eq. (28).

For sampling the structure of the cell phylogeny, we take advantage of proposal distributions implemented in the BEAST 2 software [49] and modify them to make sure they are compatible with our tree topology, so that the sampled trees are binary and contain a trunk. Specifically, the tree branch lengths are changed by scaling the heights of the internal nodes. For tree topological exploration, we use the Wilson-Balding move to perform subtree pruning and regrafting. Specifically, a random node and half of its subtree is pruned and reattached to a random branch not belonging to the moved subtree. A subtree-slide move is also used, where a random node and half of its subtree slides either upwards or downwards along branches and cross at least one node. Both those two moves include changes to the lengths of some branches. The final type of move swaps two randomly selected subtrees.

For sampling unknown parameters, we perform either scaling operations or random Gaussian walks.

SIEVE runs with a two-stage sampling strategy. In the first stage the acquisition bias correction is switched off and all parameters are explored, while in the second stage the acquisition bias correction is turned on and parameters not affecting branch lengths are fixed with their estimates from the previous stage. This two-stage strategy proved to yield more accurate parameter and tree estimates than a strategy where both parameters and tree would be explored at once, with the acquisition bias correction enabled. Additionally, the initial tree in the second stage is set to the tree summarised from the first stage.

Variant calling, ADO calling, maximum likelihood gene annotation and mutation event classification

During the sampling process $\boldsymbol^)}_}$, $\boldsymbol^)}_}$, $g_^\prime$ and $\alpha _$ (Eqs. (1), (15) and (16)) are hidden variables that are marginalised out. Therefore, to obtain estimates of these hidden variables, we infer their maximum likelihood configuration with the max-sum algorithm [50], using the maximum clade credibility tree [51] and parameters estimated from the MCMC posterior samples.

To be specific, by determining the maximum likelihood genotypes of the leaves ($\boldsymbol^)}_}$), we are able to call variants. By inferring the maximum likelihood $g_^\prime$ and $\alpha _$, the ADO state is determined. Moreover, by computing the maximum likelihood genotypes of the internal nodes ($\boldsymbol^)}_}$), SIEVE maps mutations to specific tree branches.

Mutation events are classified into different categories based on the corresponding genotype transitions (see Table 1). The single mutation ($0/0 \rightarrow 0/1$) happens when an allele of the wildtype is mutated. The homozygous coincident double mutation ($0/0 \rightarrow 1/1$) refers to the case when both alleles of the wildtype are mutated to the same alternative nucleotide, while the heterozygous coincident double mutation ($0/0 \rightarrow 1/^$) refers to the case when both alleles of the wildtype are mutated to different alternative nucleotides. The single back mutation ($0/1 \rightarrow 0/0$, $1/1 \rightarrow 0/1$ and $1/1 \rightarrow 0/1$) happens when a mutated allele mutates back to the reference nucleotide, while the double back mutation ($1/1 \rightarrow 0/0$ and $1/^ \rightarrow 0/0$) happens when both mutated alleles mutate back to the reference nucleotide. The homozygous single mutation addition ($0/1 \rightarrow 1/1$) refers to the case when the unmutated allele of the single mutant genotype mutates to the same alternative nucleotide as the mutated allele, while for the heterozygous single mutation addition ($0/1 \rightarrow 1/^$) the unmutated allele mutates to an alternative nucleotide different from the mutated allele. For the homozygous substitute single mutation ($1/^ \rightarrow 1/1$), one of the mutated alleles mutates to the same alternative nucleotide as the other mutated allele, while for the heterozygous substitute single mutation (\(1/1 \

View original article

GENOME BIOLOGY

分享书签

0 0 0 0 0 0 0

More from this channel

SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

留言 (0)