SPUMONI 2: improved classification using a pangenome index of minimizer digests

Compressed indexes

Given a text T[1..n] of length n, the Burrows-Wheeler transform (BWT) [23] is a reversible permutation of T’s characters such that \(\mathrm\lbrack\mathrm i\rbrack\) is the character preceding the \(i^}\) lexicographical suffix of T. The BWT of a repetitive text will have long runs of the same character. The symbol r denotes the number of maximal same-character runs in the BWT.

The r-index [24] is a compressed index consisting of the run-length encoded BWT, plus auxiliary structures enabling rapid queries for counting the number of times a substring occurs in T, as well as for locating all offsets in T where a substring occurs. Only O(r) space is needed overall. More details can be found in the study of Gagie et al [11].

Matching statistics and MONI

The matching statistics array \(\textsf[1..m]\) is defined for a pattern P[1..m] with respect to a text T[1..n]. The element \(\textsf[i]\) equals the length of the longest prefix of P’s \(i^}\) suffix that occurs in T. Using the r-index, Bannai et al. [25] proposed a two-pass algorithm for computing \(\textsf\) using an additional “thresholds” structure. The algorithm’s first pass iterates over P from right to left, attempting to use the LF mapping at each step to extend the match to the left by one character. If the algorithm reaches a row j and finds that match cannot be extended because the next character of P mismatches \(\textsf[j]\), the algorithm skips (“jumps”) either up or down to the next BWT run that does start with the next character of P. The direction of the jump is determined by the threshold, which indicates whether jumping up or down yields a longer match. Given the steps taken through the BWT in the first pass of the algorithm, the second pass uses a random-access data-structure over T to compute the exact matching statistic lengths for \(\textsf\). More details can be found in the study of Rossi et al [13].

Pseudomatching lengths and SPUMONI

Pseudomatching lengths (PMLs) were proposed in the SPUMONI study as being a quantity similar to matching statistics but more efficient to compute. SPUMONI’s computation of PMLs does not require the second pass of the algorithm described above. Instead, each step of the first pass that successfully extends the match using the LF mapping causes a length variable to be incremented by one. When the algorithm reaches a step where it cannot proceed using the LF mapping and has to jump instead, the length variable is reset to 0. The values taken by this variable at each step constitutes the vector of PMLs (\(\textsf\)). The \(\textsf\) values are upper-bounded by the \(\textsf\) values, but we have shown that they are similarly useful for classification of sequences [10].

The SPUMONI index consists chiefly of the r-index (omitting the sampled suffix array) and the thresholds. These structures allow SPUMONI to compute \(\textsf\) in O(r) space. Because the r-index component does not need to include the sampled suffix array, SPUMONI’s disk and memory footprint can be substantially smaller than that of the MONI algorithm described in the previous subsection.

Minimizer digestion

SPUMONI 2 computes minimizers by applying a hash function to all k-mers in a larger window of size w, \(w>k\). The k-mer with minimal hash value is the minimzer. If the minimizers found in two or more consecutive steps are identical, they are compressed to a single copy of the minimizer. SPUMONI 2 uses the RollingHasher object provided by the Bonsai C++ library [26].

These minimizer sequences are then indexed by using the r-index [11] to build a run-length encoded BWT and thresholds data-structure.

Read classification

The previous SPUMONI method classified reads by accumulating an empirical distribution of positive PMLs (with respect to the reference), and another of null PMLs (with respect to the reverse of the reference) [10]. It used a Kolmogorov-Smirnov statistical test (KS-test) to assess whether the distribution of positive PMLs was overall larger (shifted higher) compared to the null PMLs.

SPUMONI 2 uses a simpler approach; it also accumulates empirical distributions of positive and null PMLs, but it classifies reads by first computing a threshold PML value as a function of the null PML distribution. The null PML distribution is constructed by extracting small reads (substrings) from the reference, reversing the sequence, then computing PMLs for those sequences with respect to the reference. Since the sequences are reversed (not reverse complemented), they serve as random sequences that should not be classified as matching the reference, but they nonetheless share the reference’s base distribution. Pooling across simulated reads, we compile an aggregate distribution of null PMLs and compute the threshold as being equal to the largest PML that occurs at least 5 times. Because the process of computing the threshold uses the same parameters as the read-classification process (i.e., same minimizer scheme and alphabet), the choice of threshold is customized to the problem at hand.

Given the threshold PML value, SPUMONI 2 classifies a read by computing the read’s positive PMLs with respect to the reference index in many distinct non-overlapping windows of the read. By default, the read is divided into non-overlapping windows of 150 symbols. We found this window size performs well in terms of accuracy across different alphabets (minimizers or DNA). The windows at the end of the read that are less than 150 symbols are grouped together with the previous window. If a majority of the windows have a maximum PML greater than the threshold, the read is classified as matching the reference, otherwise it is classified as not matching.

Adaptive sampling simulation

Nanopore sequencers report a time series representing the amount of current flowing through a pore. The data is split into batches which are delivered to the control software for analysis. Following our experimental setup from the SPUMONI study [10], we simulated 4 batches of 0.4 s of basecalled nanopore data, with each batch consisting of 180 bp each. Batches of this size were shown to be sufficient for binary classification method [5, 7]. We start from base-called data, rather than from raw signal data.

For our adaptive sampling simulation, we simulated two different scenarios where adaptive sampling could be used to enrich for certain reads. The first scenario is a mock community which consists seven bacterial species (Staphylococcus aureus, Salmonella enterica, Escherichia coli, Pseudomonas aeruginosa, Listeria monocytogenes, Enterococcus faecalis, Bacillus subtilis) and 1 yeast species (Saccharomyces cerevisiae). The goal in this scenario is to enrich for the yeast reads by rejecting bacterial reads. To accomplish this, we built a pangenome reference over all the strains of the seven bacterial species in RefSeq [17] to identify which reads come from bacteria.

The second scenario is focused on the human microbiome where we used a 50:50 mixture of simulated human reads from the CHM13 reference [27], and real nanopore reads from the microbial species in the human gut [28]. The goal in this experiment is to enrich for the microbial reads while removing as many human reads as possible. For this scenario, we built a pangenome reference over 10 human assemblies [27, 29,30,31,32,33,34,35,36] in order to identify human reads that we want to remove.

We use both SPUMONI and minimap2 to classify batches against the pangenome index. A read classified as “present” should be immediately ejected by the pore, so subsequent batches of data are not processed. SPUMONI 2 classifies the current batch of 180 bp using the test described in “Read Classification” methods section. In the case of minimap2, aligning against a pan-genomic database usually yields numerous alignments: a primary and many secondary alignments. We examine these to ensure all are to the same species and, if so, we classify the read as “present.” For any batch after the first, we provide the entire base-called read so far to minimap2; i.e. we concatenate the bases from all the batches so far. In this way, minimap2 is redundantly processing the same bases when examining batches beyond the first. This is consistent with minimap2’s design; unlike the r-index-based algorithms that have the ability to “pause” and “resume” the matching process, minimap2 must be run on a complete read sequence.

The time required for each method is reported in Table 2; this is the total time used by each method to classify the reads across all 4 batches of data.

Sampled document array

SPUMONI 2’s sampled document array consists of 2r integers encoding the class of the suffixes at the beginning and end of every BWT run. The class labels are computed at index construction time. Note that the class labels can be stored in \(\lceil \log _2 c \rceil\) bits, where c is the number of class labels, which willl often be much less than the \(\lceil \log _2 n \rceil\) bits required for suffix array entries. An example is shown in Fig. S4 (Additional file 1).

At query time, the sampled document array is queried whenever the algorithm for computing PMLs passes through a suffix at a run boundary. As described in detail in the MONI paper [13], the algorithm proceeds base-by-base starting from the right-hand extreme of the read. Each step falls into one of two cases. When the match can be extended to the left using the LF mapping, this is called “case 1.” When the match cannot be extended to the left using the LF mapping, the algorithm uses threshold information to choose a new BWT run to “jump” to. This is “case 2.” Importantly, case 2 always results in the algorithm moving to a run boundary. That is, when the algorithm jumps, it jumps either to the beginning of a run (if it jumps down) or the end of a run (if it jumps up).

When using the sampled document array, the SPUMONI 2 algorithm compiles not only its usual array of PMLs, but also an array of document (class) labels, \(\textsf\). At position j in the read where the algorithm uses case 2, the label \(\textsf[j]\) equals the sampled document array element corresponding to the run boundary jumped to. For a position j in the read where the algorithm uses case 1, the label \(\textsf[j]\) equals the class observed in the most recent instance of case 2. When the read truly originates from one of the classes in the reference pangenome, we expect the majority of the labels in \(\textsf\) to match the true class. This algorithm is summarized in the form of pseudocode in Fig. S2 (Additional file 1).

留言 (0)

沒有登入
gif