A survey of mapping algorithms in the long-reads era

Seeding almost always uses sketched, exact, fixed-length seeds

Seeding is the procedure that consists in collecting a set \(\mathcal \) of seeds from the reference, then finding anchors between the query’s seeds and \(\mathcal \). In order to find anchors efficiently, \(\mathcal \) is stored using an index data-structure. In the following sections we detail the different types of seeds that have been used in long-read mapping.

k-mers

Substrings of length k, or k-mers, are perhaps the most commonly used seed in bioinformatics. A k-mer seed can be indexed by using a hash function to produce an integer value (usually as a 32 or 64-bit integer), which is then added to a hash table. This makes indexing of k-mers computationally cheap, provided that the hash function and hash table implementations are efficient. Methods to efficiently hash k-mers have been proposed  [75], which uses the previous k-mers hash value to compute the next one using a rolling hash function.

If a k-mer anchor is found, it is guaranteed to be exact (disregarding hash collisions). While it is desirable to produce anchors only in identical regions to minimize hits to non-homologous regions, a downside is that mutations in homologous regions will also alter the k-mers, preventing anchors in the region. Typically, a single substitution alters \(2k-1\) k-mers. The length distribution of stretches of consecutive non-matching k-mers between two homologous regions with substitutions depends on the substitution rate, and has been studied theoretically in [8].

k-mer sketching

As two consecutive k-mers share most of their sequence, they are mostly redundant. Therefore, we could reduce the memory overhead and query time without losing much information if only some k-mers were stored. Here we present different methods for picking a subsample of representative k-mers as seeds. These approaches have proven their efficiency by drastically reducing the number of objects to index while keeping high sensitivity and specificity in mapping applications. There exist two broader classes of sketching techniques, methods that offers a distance guarantee between consecutively sampled seeds, and methods that does not. Both classes of methods have been used to estimate similarity between sequences. However, the central research questions of the former class of methods involve the distance distribution between sampled seeds, the fraction of conserved seeds under mutations, and the compression ratio to original input data. In contrast, the central studied question for sketching methods without distance guarantees is often to produce unbiased estimations of sequence identity [10, 40, 69]), which distance bounded sketching methods can not guarantee [6].

No guarantee on distance

In this category methods typically stores a predetermined number of k-mers (e.g., MinHash [10], OrderMinHash [69]) from the sequence. The k-mers are selected based on the property of their hash value. For example, in MinHash sketching, a total ordering on the k-mers’ hashes is used, and a fixed set of minimal hashes in the ordering are kept. This technique gives no distance guarantee between seeds, meaning a large gap can appear between two consecutive sampled k-mers. MinHash has been used to perform extension-free mapping [46] for genome-length sequences and to find read-to-read overlaps in long-read assembly [7, 90]. However, fixed-size sketches do not adapt well to different read lengths. The number of sampled seeds remains constant for any number of distinct k-mers. Because of this, two similar regions from sequences of different sizes will not automatically have the same selected seeds, which is a desired property for seeding. Therefore this approach was later replaced by other scaled sketch strategies [40, 45]. FracMinHash has been used for long read mapping [25] (called universal minimizers in their study), and works well when reads are long enough, but it is important to note that theoretically there does not exist a distance guarantee for scaled sketch hashing methods, regardless of the density of the sketch.

Distance guaranteed

The first distance bounded k-mer sketching technique proposed for long-read mapping was minimizers. Minimizers have been introduced in two independent publications [83, 88], and was popularized in bioinformatics by the tools minimap [58] and minimap2 [59]. In our framework, minimizers are k-mers sampled determined by three parameters m, w, and h. h is a function that defines an order, e.g., the lexicographical order. Given the set of w consecutive k-mers in a window at positions \([m, m+w-1]\) on the sequence, a minimizer is the k-mer associated with the minimal value for h over this set (see left panel in Fig. 3). Minimizers are produced by extracting a minimizer in each consecutive window \(w \in [0,|S| - w+1]\) over a sequence S.

Fig. 3figure 3

Illustration of spaced-seeds, minimizer selection, syncmer selection and context dependency. Here, two sequences \(s_1\) and \(s_2\) are different from a single mutated base (second base, in pink). When comparing those sequences, one would like to focus on common bases, i.e., bases highlighted in grey. In the left panel, we present spaced-seeds for \(k=5\), with a wildcard at second position (represented by an underscore). We observe that identical spaced-seeds can be spelled over a mutated locus, e.g., T_GAGG. In the middle panel, we present a selected minimizer with \(k=5, w=3\). One blue window is presented, a second is suggested in lighter blue. A star shows the position of the selected k-mer in the window (we use lexicographic order). The mutated base has an impact on the overall window content, therefore a k-mer from the (unmutated) region of interest in \(s_1\) is no longer selected in \(s_2\). On the contrary, in the right panel, we show that syncmers can be more robust in this situation. We choose \(k=5, s=2\) and present closed syncmers. We underline the smallest \(s-\)mer in each k-mer in blue and a star shows the selected k-mers. We see that in this example, the mutated base has no impact on the syncmer selection, and the same syncmer is selected in the region of interest for \(s_1\) and \(s_2\)

Since at least one minimizer is selected in each window, they have a distance guarantee. While the distance guarantee (hence seed density in regions) is desired for mapping applications, it is also desired to sample as few minimizers as possible to reduce storage. Different optimizations have been proposed to reduce the density of sampled minimizers while keeping the distance guarantee. Weighted minimizers [48] implement a procedure to select k-mers of variable rareness. In order for k-mers from highly repetitive regions not to be as likely as others to be selected, it first counts k-mers, and down weights the frequently occurring ones. Then it takes this weight into account for the hashing procedure. If low occurrence k-mers are too far away in a query, a solution [60] allows sampling minimizers also in the repetitive region by keeping some of the lowest possible occurrences among the minimizers in the repetitive region.

In minimizer sketching, the choice of the minimizer in each window depends on the other k-mers in the window. This property is called context dependency [91]. Context dependency is typically not desired in sketching methods as neighbouring k-mers affected by mutations may alter the minimizers in a window. However, for finding anchors it is desired to guarantee that the same k-mers are sampled between two homologous regions regardless of neighboring mutations. Therefore, context-independent methods have been proposed such as syncmers [23] and minimally overlapping words (MOW) [32], where the sampling does not depend on the neighboring k-mers. Syncmers was used in the context of long-read mapping [91] in an alternative implementation of minimap2 and even more recently in [22]Footnote 2. For their construction, syncmers use s-mers of size \(k-s+1\) (\(s<k\)) occurring within k-mers (see right panel in Fig. 3 for an illustrated difference with the minimizers, and Additional file 1: Fig. S1). The k-mer is selected if its smallest s-mer meets some criteria. An example criteria is that the s-mer appears at position p within the k-mer (\(0 \le p < k-s+1)\) (these are called open syncmers), a more studied category is closed syncmers where p must be the first or the last s-mer position in the k-mer. This way of selection uses properties intrinsic to each k-mer, therefore is context-free. Closed syncmers also have a distance guarantee. By construction, syncmers tend to produce a more even spacing between sampled seeds while still allowing a distance guarantee.

Fuzzy seeds

Due to read errors and SNPs between the reference and sequenced organism, it is in many scenarios desired that a seed anchors the query and the reference in homologous regions even if the seeds extracted in regions differ. In other words, we would want similar seeds to hash to identical hash values. A hash function that produces identical hash values for similar but not necessarily identical inputs is usually called a locality-sensitive hash function. We will refer to seeds produced under such methods as fuzzy or inexact seeds. Several methods to produce fuzzy seeds have been described.

Perhaps the most common one is spaced seeds. Within a spaced seed, some positions are required to match (called fixed positions), while the remaining positions can be ignored (called wildcards or don’t care positions). Within a k-mer, fixed positions can be selected as wildcards by applying particular masks on the k-mer’s bases [44]. Spaced seeds are effective for data with substitutions and are, for example, used in the popular sequence mapping software BLAST [4], metagenome short-read classification [13], and in long read mapping tool GraphMap [95]. Typically, multiple seed patterns are used [62, 95] where the overlap between the fixed positions in the seeds should be minimized [43] to increase sensitivity. For example, GraphMap queries three different seeds to the index for each position in the query. This design is capable of handling substitutions and indels of one nucleotide. We provide details on this scheme in Additional file 1: Figs. S2 and S3. However, spaced seeds can only handle indels if multiple patterns are queried per position, and the number of patterns required increases with indel size [35]. Although the computation of good sets of spaced seed patterns has been optimized [44], using such seeding can become computationally prohibitive if the application requires to match over indels beyond a couple of nucleotides.

As indels are a frequent source of variability on long-reads, spaced seeds have, except GraphMap, not been frequently used in long-read mapping algorithm designs. There are other types of fuzzy seed constructs, such as permutation-based seeds [56], but they only tolerate substitutions and have been used in short-read mapping.

Traditionally, anchoring over indels has typically been solved by querying multiple seeds in a region and performing post-clustering of nearby anchoring seeds, which are then inferred as an anchoring region. Such an approach usually provides gold standard sequence similarity queries [4, 52]. However, it comes at a substantial computational cost, not only because of the post-clustering step but in addition because relatively short seeds must be used to guarantee sensitivity, which can yield many anchors to other regions.

To remove the overhead of post-processing of nearby seeds, one can instead link several k-mers into a seed and represent it as a single hash value before storing it in the index. Such linking techniques has recently become popular in the long-reads era, where indels are frequent. One proposed method is to link two nearby minimizers [18] or several MinHash seeds [25] into a seed. Linking several minimizers into a seed is usually a relatively cheap computation as the minimizers constitute a subset of the positions on the reference. Such seeding techniques have been used in long-read mapping [25], and long-read overlap detection in genome assembly [18] and error correction [87]. A downside with these methods is that the linking of nearby minimizers or MinHash seeds implies that if some of them are destroyed due to mutations in a region, all the seeds in the region will be destroyed. Put another way, nearby seeds share redundant information (in the form of shared minimizers or MinHash seeds). Therefore, alternative approaches such as strobemers [84] (see right panel in Fig. 4) have been described, where the goal has been to reduce the information between close-by seeds by linking k-mers at seemingly random positions within a window. Such pseudo-random linking of k-mers implies that, if one seed is destroyed due to a mutation, a nearby seed may still anchor the region. Strobemers have been shown effective at finding anchors between long-reads and for long-read mapping [84], and have been used in short-read mapping programs [85], but the pseudo-random linking come at an increased computational cost to simple linking of neighboring k-mers.

Fig. 4figure 4

Illustration of strobemers’ capacity to handle indels. As in Fig. 3, two sequences are presented. This time, \(s_2\) has an insertion (pink G). On the left panel, minimizers are selected using \(w=2, k=5\). Blue stars point selected minimizers in each blue window. One can see that the only safe region to generate minimizer is the CGGTT sequence after the insertion, that is shared and of length \(\ge k\). Put differently, k-mers in red have no chance to be in common between the two sequences. However, in this example, the scheme fails to select a common minimizer in the safe region. Strobemer selection is presented in the right panel, using \(k=2,s=2,w=2\). At each position, the first k-mer is selected to be the start site of the strobemer. Then, in the non-overlapping window (of size w) downstream to the first k-mer, a second k-mer is selected according to one of the selection techniques presented in [84] (we illustrate selecting the lexicographical minimizer). We underline the bases that are kept for each strobemer. For instance in \(s_1\), the first k-mer is CG at positions 0 and 1, then the next window starts at position 2. Two k-mers are computed from this window, AC and CG, and AC is the minimizer. Therefore, the strobemer is (CG,AC). Again, strobemers with no chance to be shared between \(s_1\) and \(s_2\) are colored in red. For strobemers, it is the case when at least one part contains the mutated base. We note that not only the CGGTT region has a common strobemer (CG,GT) in both sequences, but also that the scheme allowed to “jump over” the mutated G and could select another common strobemer (GA,CG) in a more difficult region. The strobemers in this example consists of two k-mers (\(s=2\)) but they can be constructed for other \(s>2\)

Two other indel tolerant fuzzy seeding techniques are BLEND seeds [31] and TensorSketch [50]. The BLEND seeding mechanism mixes SimHash [17] (an alternative locality sensitive hashing to MinHash) applied either to minimizers or strobemers to construct fuzzy subsampled seeds. The authors showed that read mapping and overlap detection with BLEND seeds implemented in minimap2 [59] could improve mapping speed and, in some cases, accuracy. TensorSketch [50] is based on computing all subsequences of a given length within a given window. The idea is that similar sequences will share many similar subsequences and lie close in the embedding space. TensorSketch has been used in long read mapping to graphs and offers high sensitivity but at a significant computational cost to approaches using exact seeds [49].

Dynamic seeds

Previously discussed seeds share the characteristic that they can all be produced and inserted in a hash table and, consequently, only require a single lookup. Such techniques are typically fast and, hence, popular to use in long-read mapping algorithms. However, the downside is that if a seed differs in a region between the reference and the query (e.g., due to an error), there is no way to alternate the seeds in this region at mapping time. There are, however, other types of seed constructs that we here refer to as dynamic seeds that can be computed on the fly at the mapping step and then used as seeds downstream in the read mapping algorithm.

Maximal exact matches (MEMs) [20] are matches between a query and reference sequence that cannot be extended in any direction on the query or reference without destroying the match. These are typically produced by first identifying a k-mer match and then applying an extension process. MEMs are guaranteed to be an exact match between the query and the reference and are bounded below by length k but do not have an upper threshold for seed size. MEMs have been used in earlier long-read mapping programs (e.g., BWA-MEM) [15, 57] and for long-read splice mapping [86], but these seeds are more computationally expensive to compute and are typically slower than single-query seed-based algorithms.

Minimal confidently alignable substrings (MCASs)

If a query was sampled from a repetitive region in the reference, one might likely find several clusters of anchoring seeds across the reference. Further dynamic programming operations to decipher the true origin region of the query are typically costly or even unfeasible if too many copies have to be considered. The query might also be attributed to the wrong copy because of the sequencing errors. A recent contribution [47] proposed a solution for seeding in repetitive regions. The procedure finds the smallest substrings that uniquely match (MCASs) between the query and the reference. There can be as many as the query length in theory. In practice, the more divergent the repeats, the shorter the MCASs, since a base pertaining to a single copy is more likely to be found.

Implementation of the seeding stepSeed transformations before indexing

Originally, minimizers used a lexicographical ordering. However, in our four base alphabet, this can tend to select sequences starting with long alphabetically smaller runs such as “AAA...”. Random hash functions assigning each k-mer a value between 0 and a maximum integer are preferred [88].

Long read technologies are known for accumulating errors in homopolymer regions, typically adding/removing a base in a stretch of a single nucleotide. Sequences can be homopolymer-compressed before finding k-mers. Homopolymers longer than a size s are reduced to a single base, then k-mers are computed over the compressed sequence. For instance, for \(s=3,\,k=4\), an original sequence ATTTTGAAAACC is compressed to ATGACC, and the final k-mers are ATGA, TGAC, GACC. This procedure allows finding more anchors while indexing fewer k-mers or minimizers. Homopolymer compression appears in long-read mapper implementations (e.g., [59]).

In regions of low complexity (e.g., ATATATA, CCCCC) the standard minimizer procedure keeps all minimal k-mers in windows. It is then possible for two k-mers to get the minimal value and to be selected, which tends to over-sample repetitive k-mers. A robust winnowing procedure is proposed in [48], which avoids the over-sampling effect by selecting fewer copies of a k-mer, but increases context dependency.

Hash tables prevail for seed indexing

Indexing of fixed size seeds is usually done using hash tables (although FM-indexes for k-mers exist [9]). In the context of sketching, invertible hash functions have been a key asset for using minimizers as k-mers representatives. In other words, a hash value is associated with one and only one k-mer, and the k-mer sequence can be retrieved from the hash value (using reciprocal operations). This choice allows a very fast k-mer/minimizer correspondence, but is memory-wise costly as it implies that the fingerprints of the hash table are not compressed (which is mitigated by the density of the sketching). Minimizers are then used to populate a hash table, which associates them to their position(s) in the reference and their strand information (usually hashed seeds are canonical k-mers: the smallest lexicographic sequence between the original k-mer and its reverse complement). There also exists learned index data structures [51] that further accelerates the querying of minimizers.

Variable-length seeds are indexed in full-text data structures (e.g., suffix arrays or FM-index [30]), which allow to find and count arbitrarily long queries in the reference. They have been used in the first versions of long-read mappers. However, variable-length seeds takes longer to query in these data structures, while hashed matches are queried in constant time. Since minimizers represent fixed-length k-mers, hash table solutions mainly prevail.

Selecting seeds to query

In [59], it is proposed to index all minimizers from the reference during the indexing phase (although the latest versions include weighted k-mers and robust winnowing heuristics), and instead skip to use highly repetitive k-mers to find anchors (also called soft masking). The authors noticed that in cases where a query is sampled from a repetitive region, such a procedure prevents it to be seeded. Techniques that use longer fuzzy seeds (e.g., strobemers) [31] reduce the number of masked regions, although it comes at the cost of sensitivity. Another approach [82] computes a new set of minimizers on the targeted reference regions in order to obtain finer candidate chains, which helps alignment confidence particularly in repeated or low complexity regions.

Chaining is dominated by dynamic programming with concave gap score functionsA dynamic programming problem

Once the reference’s seeds are indexed, a set of seeds is extracted from the query and looked up in the index to find anchors. Anchors’ positions on the query and reference are stored, as well as the forward/reverse information. Instead of directly extending the alignment between anchors, as it is done in short-read mapping, a step of chaining is added and meant to accelerate further extensions. Chaining acts as a filter and a guide for smaller extensions that need to be realized only between selected anchor pairs. Without it, too many extension procedures, most of which would be dead-ends, would have to be started.

In an ideal case, there is a unique way of ordering anchors by ascending Cartesian positions in the (reference, query) space, which passes by all the anchors. In practice, some anchors are spurious, others correspond to repeated regions and yield different possible chains. Moreover, more parameters must be taken in account. Thus, methods optimize different aspects (also illustrated in Fig. 5):

A1) Do not allow anchors which are not ascending either by the anchors’ start or end coordinates in both the query and reference (see first case in Fig. 5).

A2) Avoid discrepancies in diagonals between anchors (second case in Fig. 5).

A3) Do not allow large spaces between consecutive anchors of the chain (see third case in Fig. 5).

A4) Favor the longest possible anchor chain (fourth case in Fig. 5).

A5) If inexact matches in seeds are possible, each seed represents a couple of intervals on the target and the query. Find a subsequence of seeds that minimize the sum of the Levenshtein distances computed on these couples of intervals (roughly, ensure that the matched regions on the target and query are as similar as possible).

The problem of finding an optimal chain using non-overlapping anchors has been called the local chaining problem [1], although in this application anchors can overlap. The score \(f(i+1)\) represents the cost of appending an anchor \(a_\) to a chain currently ending by anchor \(a_i\). This score is often called the gap score in the literature, though it includes other constraints, as described above. The chaining problem for long reads seeks to find an optimal colinear chain with a positive gap score.

Fig. 5figure 5

An illustration of the different constraint taken into account in the gap score functions. The reference axis shows a genome region of interest where anchors were found, not the whole reference. A1–A4 correspond to items in the text in the “A dynamic programming problem” section. Anchors are showed in blue. The selected chain with respect to the described constraint is highlighted in yellow and a line approximately passing by its anchors is showed in red. The longest chain is covered by a green line if it was not selected

Mainly, methods use either a two-step approach: (1) find rough clusters of seeds as putative chains, followed by (2) find the best scored chain among the selected clusters; or work in a single pass and apply a custom dynamic programming solution to find the best anchor chain. We can start by noting that one of the first mappers dedicated to long-reads solved a global chaining problem to determine a chain of maximum score, by fixing starting and ending points (anchors) such that their interval is roughly the size of the query [15]. Such an approach would easily discard long gaps and spaces in alignments.

Chaining in two steps Clusters of seeds are found through single-linkage in 2D space

The two-step approaches rely on a first clustering step. Although it tends to be replaced by single-step chaining (see

留言 (0)

沒有登入
gif