Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries

The following sections provide the details referred to in the “Results” section. We will use the following notation:

\(\varvec\):

k-mer size

\(\varvec\):

window size for minimizers

\(\varvec\):

number of hash functions

\(\varvec\):

size (in bits) of an individual Bloom filter

\(\varvec\):

number of values (k-mers) stored in an individual Bloom filter

\(\varvec_}\):

false-positive rate

\(\varvec\):

number of (user) bins indexed in an (H)IBF

\(\varvec\):

number of (technical) bins of the IBF data structure

\(\varvec_}\):

maximum number of technical bins in each IBF of the HIBF

\(\varvec\):

expected decrease in the amount of representative k-mers when switching from k, k-minimizers to w, k-minimizers

\(\varvec\):

number of minimizers in a query

\(\varvec\):

number of false-positive answers

\(\varvec_}\varvec\):

probability of returning a false-positive answers when querying x minimizers in a Bloom filter of a false-positive rate \(p_\)

\(\varvec\):

threshold for a given x

\(\varvec\):

(constant) correction that is added to the threshold

\(\varvec_}\):

threshold for correction term; i.e., increase c until \(b_p(x, a+c) < p_\)

The HIBFComputing an HIBF Layout—a DP algorithm

In this section, we explain how to compute a layout, i.e., how we decide which user bins to split and which to merge (Fig. 3). To this end, we engineered a dynamic programming (DP) algorithm that heuristically optimizes the space consumption of a single IBF, while considering the space needed for lower level IBFs for merged bins. Thus, computing the layout on the first, main level IBF optimizes its layout while estimating the size of the entire HIBF. As outlined in “The HIBF” section, we then recursively apply the DP algorithm on each lower level IBF.

Assume that we have b user bins \(\text _0,\ldots ,\text _\) sorted in decreasing order of their size, \(|\text _0| \ge \ldots \ge |\text _|\), which we want to distribute across t technical bins \(\text _0,\ldots , \text _\) in a single IBF. While not strictly necessary, the order is chosen such that small user bins cluster next to each other because in our algorithm only contiguous bins may be merged. We denote the union size estimate of a merged bin that stores the k-mer multiset \(\text _i \cup \text _ \cup \ldots \cup \text _j\) as \(U_\). We use HyperLogLog sketches [22] (see the “Union estimation” section) to quickly estimate the size of the union when merging user bins. Since the IBF only stores a representative k-mer’s presence (k-mer content), not how often it was inserted, the size of the merged bin may be smaller than the sum of sizes of the respective user bins. The effect is that merging user bins of similar sequence content results in smaller sized merged bins which is beneficial for the overall size of the IBF. We exploit this advantage by the optional step of rearranging the user bins based on their sequence similarity (see the “Rearrangement of user bins” section).

Further, the user needs to fix the number of hash functions h, the desired false-positive rate \(p_\), and the maximum number \(t_\) of technical bins to be used for each IBF of the HIBF. The former two parameters are needed to estimate the IBF size and must correspond to the parameters of building the index afterwards. The latter is critical for the HIBF layout. See the “Recommended choice of parameters” section for a discussion of sensible defaults.

Regarding the expected false-positive rate of the HIBF index, we point out that when splitting a user bin, we introduce a multiple testing problem. This happens because we query a k-mer for a split bin several times in several technical bins. We correct for this by increasing the size of the respective technical bins by a factor \(f_(s, p_)\) where s is the number of technical bins the user bin is split into and \(p_\) is the desired false-positive rate (see the “IBF size correction for split bins” section).

The general sketch of the algorithm is the following: The dynamic programming (DP) matrix M has t rows, representing \(\text _0,\ldots , \text _\) and b columns, representing \(\text _0,\ldots , \text _\). When we move horizontally in the matrix, we consume multiple user bins while remaining in a single technical bin. This indicates a merged bin. When we move vertically, we consume multiple technical bins while remaining in a single user bin. This indicates a split bin. We treat a single bin, introduced for clarity in the “Methods” section, as a split bin of size 1. We do not move along the diagonal. The above semantics allow us to verbalize a structure that is then used to estimate the space consumption and compute a local optimum in each cell of the DP matrix. Specifically, the space consumption is tracked using two \(t \times b\) matrices \(M_\) and \(L_\):

\(M_\) tracks the maximum technical bin size \(\underset}(|\text _g|)\) when the first \(j + 1\) user bins are distributed to \(i + 1\) technical bins. Minimizing the maximum technical bin size optimizes the IBF space consumption, since it directly correlates with the total IBF size (Fig. 2).

\(L_\) tracks the space consumption estimate of all lower level IBFs that need to be created for each merged bin given the structure imposed by \(M_\).

Initialization

For the first column, when \(j = 0\), there is only a single user bin \(\text _\) which can be split into t technical bins. Therefore, the maximum technical bin size stored in \(M_\) is the size of \(\text _\) divided by the current number of technical bins i it is split into, corrected by \(f_\) (see the “IBF size correction for split bins” section). Since no merging is done, no lower level IBFs are needed and \(L_\) is always 0.

$$\begin \forall i \in \&:&M_=\frac_|}\cdot f_(i+1,p_)\end$$

(1)

$$\begin \forall i \in \&:&L_=0 \end$$

(2)

For the first row, when \(i = 0\) and \(j\ne 0\), all user bins have to be merged into a single technical bin. The size of the resulting merged bin is estimated using the precomputed unions \(U_\) (see the “Union estimation” section). Since we create a merged bin in each step, the resulting additional space consumption on lower level IBFs is estimated by the sum of sizes of contained user bins times the maximal number of levels \(l=\lceil \log _}(j+1)\rceil\). We use the sum instead of the union here because we do not yet know how the lower level IBFs are laid out, so we estimate the worst case of storing every k-mer again on each lower level.

$$\begin \forall j \in \&:&M_=U_\end$$

(3)

$$\begin \forall j \in \&:&L_= l \cdot \sum _^j |\text _g| \end$$

(4)

Fig. 7figure 7

DP algorithm traceback visualization. Matrices a, b, and c visualize the algorithm for a layout distributing \(b=7\) user bins (columns) to \(t=5\) technical bins (rows). a Shows the traceback of the initialization, all coming from the virtual starting point \((-1,-1)\). E.g., \(M_\) represents the sub-layout of splitting \(\text _0\) into all available technical bins. b Shows which cells \(j' < j - 1 = 3\) are considered in the horizontal recursion. E.g., \(j' = 2\) would indicate to merge \(\text _3\) and \(\text _4\) into \(\text _3\). c Shows which cells \(i' < i = 3\) are considered in the vertical recursion. E.g., \(i' = 1\) would indicate to split \(\text _4\) into \(\text _2\) and \(\text _3\)

Recursion

We store the maximum technical bin size in \(M_\) and the lower level costs in \(L_\) and we want to optimize the total HIBF space consumption. It is computed by \(M_\) times the number of technical bins, which is the size of the first, main level IBF, plus the lower level costs \(L_\):

$$\begin |HIBF| = M_ * (i + 1) + \alpha * L_ \end$$

(5)

The parameter \(\alpha\) can be used to tweak the influence of lower levels on the space and query time of the HIBF. The query time increases when many lower levels are introduced since we have to traverse and query a lot of IBFs, but the space consumption is often lower. When \(\alpha = 1\), the space consumption is optimized by expecting our lower level space estimate to be exact. If we choose higher values for \(\alpha\), we artificially increase the costs of merged bins and their lower levels, thereby decreasing their occurrence in the layout. A sensible default that was experimentally derived to work well is \(\alpha = 1.2\).

In the recursion, we process each cell \(M_\), \(i \ne 0\) and \(j\ne 0\), by deciding for the next user bin \(\text _j\) whether to (1) split it into \(i - i'\), for some \(i' < i\), technical bins, or (2) merge it with all user bins starting at \(j'\), for some \(j' < j\).

In the first case, when splitting \(\text _\) and thereby moving vertically, we want to find the \(i' < i\) that results in the smallest overall HIBF size \(v_\) (Fig. 7 c). Semantically, we originate from the layout \(M_\) which already distributed \(\text _0,\ldots ,\text _\) into \(\text _0,\ldots ,\text _\), leaving the technical bins \(\text _,\ldots ,\text _\) to store the split content of \(\text _\). The new HIBF size is computed analogous to Eq. 5 with the new maximal technical bin size \(m_v\) and the new lower level costs \(l_v\). Since we introduce no merged bin, \(l_v\) is simply the former costs \(L_\). \(m_v\) is computed by taking the maximum of the former value \(M_\) and the split bin size we obtain from equally dividing the k-mer content of \(\text _\) into \(i - i'\) technical bins corrected by \(f_\).

$$\begin v_=\min _}\left( \underbrace, \frac_|\cdot f_(i-i', p_) }\right) }_ \cdot (i+1)+\alpha \cdot \underbrace}_\right) \end$$

(6)

In the second case, when merging \(\text _\) and thereby moving horizontally, we want to find the \(j' < j - 1\) that results in the smallest overall HIBF size \(h_\) (Fig. 7 b). Semantically, we originate from the layout \(M_\) which already distributed \(\text _0,\ldots ,\text _\) into \(\text _0,\ldots ,\text _\), leaving technical bin \(\text _\) to store the merged content of \(\text _,\ldots ,\text _\). The new HIBF size is computed analogous to Eq. 5 with the new maximal technical bin size \(m_h\) and the new lower level costs \(l_h\). \(m_h\) is computed by taking the maximum of the former value \(M_\) and the merged bin size we obtain from the precomputed Union \(U_\). Since we introduce a new merged bin, \(l_h\) is computed by adding to the former costs \(L_\), the space estimate of l expected lower levels times the sum of contained user bin sizes (see Eq. 4).

$$\begin h_=\min _}\left( \underbrace, U_ \right) }_ \cdot (i+1)+\alpha \cdot \underbrace+l\cdot \sum _^j |\text _g|\right) }_\right) \end$$

(7)

Given \(m_v\), \(l_v\), \(m_h\), and \(l_h\) for which the minima in \(v_\) and \(h_\) are achieved, respectively, the values of the cells \(M_\) and \(L_\) are updated according to the minimum of \(v_\) and \(h_\):

$$\begin M_=\left\ m_ & \text v_ \le h_ \\ m_ & \text \end \quad L_=\left\ l_ & \text v_ \le h_ \\ l_ & \text \end\right. \right. \end$$

(8)

The above recurrence can be used to fill the two dynamic programming matrices row- or column-wise because the value of a cell depends entirely on cells with strictly smaller indices. The traceback can then be started from the cell \(M_\).

Since we use HyperLogLog sketches, the algorithm runs quite fast in practice. For example, it takes only 13 min to compute the layout with \(t_=192\) of user bins for a data set consisting of all complete genomes of archaea and bacteria species in the NCBI RefSeq database [17] (about 25,000 genomes with total size of roughly 100 GiB).

HyperLogLog estimates

The HyperLogLog (HLL) algorithm [22] approximates the number of distinct elements in the input data that were previously converted into a set of uniformly distributed, fixed-size hash values. It is based on the observation that a hash value of length \(q>p\) has a probability of \(\frac^\) of containing p leading zeros. When keeping track of the maximum number of leading zeros lz of the data’s hash values, one can estimate its size via \(2^\). Since this estimate has a very high variance, it is improved by splitting the data into \(m = 2^b\), \(b \in [4, q-1]\), subsets, and keeping track of lz for each subset. These m values are called a sketch of the input data. The size estimate of the original input is computed by using the harmonic mean and bias correction of the m values of the subsets (see [22] for details).

In our application, we convert the input sequences into 64-bit k-mers by arithmetic coding, hash them using the third-party function XXH3_64bitsFootnote 2, and then apply the HLL algorithm with \(q=64\) and \(b=12\) using our adjusted implementation of the HyperLoglog libraryFootnote 3.

Union estimation

Given two HyperLogLog sketches, e.g., created from two user bins, with the same number of values m, they can be merged by inspecting the m values of both sketches and simply storing the larger value (lz) of each. The resulting new sketch can estimate the size of the combined sketches, namely the number of distinct elements in the union of the two user bins. When merging several sketches, we merge the first two and then merge the rest iteratively into the new union sketch.

In our application, when iterating column-wise over the matrix of the DP algorithm, we precompute for each column j the \(j-1\) union estimates \(U_\) for \(j' \in [0, ... , j-1]\) with

$$\begin U_ = \text _i \cup \text _ \cup \ldots \cup \text _j. \end$$

(9)

Rearrangement of user bins

When merging two or more user bins, we unite their k-mer content by storing all unique k-mers present in each user bin. It follows that the size of the union is always smaller or equal to the sum of sizes of the united user bins. For a merged bin in the HIBF, this means that we can potentially save space by uniting user bins of similar k-mer content. To exploit this advantage, the user has the option to partially rearrange the sorted order of user bins based on their similarity.

We estimate the similarity of two user bins by computing the Jaccard distance based on HyperLogLog sketches. The Jaccard distance \(d_J(A,B)\) is defined as 1 minus the Jaccard index J(A, B) and can be rewritten in the following way:

$$\begin (A,B)=1-J(A,B)=2- \frac} \end$$

(10)

We use the rightmost term to compute the Jaccard distance using the union sketch of A and B to estimate \(|A\cup B|\).

Fig. 8figure 8

Rearranging user bins by agglomerative clustering. Given eight user bins (UBs) as input, sorted by their size (left-hand side), the example shows how the order may change after applying an agglomerative clustering algorithm based on sequence similarity (right-hand side). The agglomerative clustering arranged UB-2 and UB-4 next to each other, as well as UB-5 and UB-7

When rearranging, we still want to partly maintain the initial order of user bin sizes for two reasons: (1) for the HIBF it is beneficial to merge small user bins (see Fig. 2), and (2) a limitation of the DP algorithm is that only user bins next to each other can be merged, so ordering them by size clusters small bins together. We therefore only rearrange user bins in non-overlapping intervals. The size of an interval is determined by the parameter \(r_m\), the maximum ratio of the largest to the smallest user bin within the interval.

Within an interval, we perform an agglomerative clustering on the range of user bins based on their Jaccard distances. The order of user bins is changed according to the resulting cluster tree, leaving similar user bins adjacent to each other (see Fig. 8).

IBF size correction for split bins

Recall that for a given false-positive rate \(p_\), the size in bits m of an individual Bloom filter is proportional to the amount of k-mers n it is designed to store. If h is the number of hash functions, then it holds:

$$\begin m \approx -\frac^\frac)} \end$$

(11)

If we split a user bin into s technical bins and divide the k-mer content equally among them, we cannot simply use the above equation 11 on each technical bin to determine its size because we introduce a multiple testing problem. Specifically, the probability for obtaining a false-positive answer for the user bin when querying those s technical bins is \(1-(1-p_)^\), since we only get no false-positive answer if all split bins have no false-positive answer. We can determine the required false-positive rate \(p_\) such that conducting multiple testing yields the desired false-positive rate \(p_\):

$$\begin p_ = 1 - (1 - p_)^\frac \end$$

(12)

We use \(p_\) to determine a factor for increasing the size of split bins to mitigate the effect of multiple testing:

$$\begin f_(s,p_)=\frac^\frac)}^\frac)} \end$$

(13)

More details can be found in Additional file 1. For example, given \(h=4\) and \(p_=0.01\), if we split a user bin into \(s=5\) technical bins, their size computed with equation 11 must be increased by the factor 1.598. If we split the same user bin into \(s=20\) technical bins, the size must be corrected by 2.344.

Building an HIBF index from a layout

The hierarchy in an HIBF layout forms a tree (Fig. 3). To avoid reading input files several times, we build the index using a bottom-up strategy implemented by a recursive construction algorithm. The recursion anchor lies at IBFs without merged bins (leaves of the tree) and traverses the layout similar to a breadth-first search. At a leaf, we read the respective input samples and transform them into their representative k-mer content. Based on the k-mer contents, we can construct an IBF whose size fits the desired maximum false-positive rate. We then insert the k-mer contents into the IBF. Moving along the tree, we keep the k-mer contents of child nodes to insert them into the respective merged bins of parent IBFs. At a parent node, the input samples from split and single bins are read, processed and, together with the merged bin content from child nodes, used to construct the current IBF. The algorithm ends at the root IBF on level L1. We trivially parallelized the construction of independent lower levels.

Querying an HIBF

To query the HIBF for a sequence, we employ a top-down strategy. First, the query sequence is transformed into representative k-mers in the same way as it was done for the index construction. For each k-mer, we determine its membership in the first, main level IBF of the HIBF by counting the total number of k-mers that occur in each technical bin. Subsequently, we gather all technical bins whose count is equal to or exceeds a certain threshold. For split bins, the k-mer counts of all bins associated with the respective user bin need to be accumulated before thresholding. For split and single bins, we can directly obtain the corresponding user bins to answer the query. For merged bins that exceed the threshold, we need to apply the same procedure recursively on the associated child IBF on a lower level. Notably, having a threshold allows us to skip traversal of lower level IBFs whose upper level merged bins do not exceed the threshold. In practice, this means that we only have to access a fraction of the HIBF. The final result is the set of all user bins that contain a query.

The threshold can be either user-defined or computed for a given amount of allowed errors e in the query. For the latter, we rely on the method introduced in [12]. Briefly, we utilize a well-known k-mer lemma [23]. However, when winnowing minimizers are used, we apply a probabilistic model that accounts for how errors might destroy relevant k-mers [12]. The model returns a threshold value given how many representative k-mers are in the query. In this work, we further refined the model to incorporate the effect of the false-positive rate on the threshold (see the “Impact of the HIBF’s false-positive rate” section). The thresholding step is a convenience for the user missing in tools like Mantis [5].

ValidationKey parameters

The HIBF has many parameters that influence the memory consumption and run time, and partially depend on each other. We acknowledge that this can be challenging for the user. To alleviate the problem, we will in the following sections (1) describe the impact of the key parameters on the performance of the HIBF and (2) give sensible defaults and describe how the HIBF can set them automatically given very few, easy to understand parameters.

The choice of t max

The choice of \(t_\), i.e., the maximum number of technical bins of the individual IBFs in an HIBF, influences the depth of the HIBF which impacts both total index size and query runtime. We show in the following that the optimal \(t_\) is near the square root of the number of user bins b, i.e. \(sq=\lceil \sqrt/64\rceil \cdot 64\).

If \(t_\) is relatively small, the depth of the HIBF is large, but the individual IBFs will be small. In terms of query runtime, querying a small IBF is fast, but we potentially have to traverse many levels. In terms of space, a small \(t_\) results in storing a lot of redundant information and thus the space consumption increases. If we choose \(t_\) relatively large, the depth of the HIBF and the number of IBFs will be low, but the individual IBF sizes will be large. In terms of query runtime, we then expect few queries on each level, but querying an IBF is more costly. In terms of space, the larger \(t_\), the more technical bins we have to cleverly distribute the content of the user bins to and thus lower the space consumption. For some values of \(t_\), the space consumption increases, although \(t_\) is large. This happens because IBFs on the last level contain only a few UBs, which will lead to relatively high correction factors \(f_h\) (see the “IBF size correction for split bins” section).

Table 1 Runtime penalty for an increasing number of bins in an IBF. The values represent ratios by which the query run time increases when using an IBF with b bins compared to an IBF with 64 bins. We simulated b equally sized (user) bins for \(b\in \\) of random sequence content (1 GiB in total) and sampled 1  million reads of length 250. We then constructed an IBF over these b bins using 32-mers and 2 hash functions. We conducted this experiment for different false-positive rates, including 0.0001, 0.0125, and 0.3125, which resulted in IBFs of different densities. Next, we measured the time required to count the k-mer occurrences in each of the b bins for all 1 million reads. The reported values are the average of five repetitions

With a simple experimentFootnote 4, we validated one of the above observations, namely that the query runtime of a single IBF increases with the number of (technical) bins (see Table 1). The results show an increase in query runtime for each false-positive rate. Additionally, we observe that the higher the false-positive rate, the greater the query runtime penalty with an increasing number of bins. We can use those results to optimize the choice of \(t_\).

Fig. 9figure 9

Expected cost versus real cost. The data are all complete genomes of archaea and bacteria from the RefSeq database. The relative expected cost was computed with our model, while the relative real cost was measured with a \(t_\)-HIBF with \(p_=0.015\), 4 hash functions and (24, 20)-minimizer. The cost is given as a ratio of expected cost to real cost for a 64-HIBF because measurements are platform-dependent. The correlation is \(> 0.9\), although we systematically overestimate the required time. Nevertheless, the three best \(t_\) are in both cases 192, 256, and 512 with predicted values 0.53, 0.56, and 0.75 and real values 0.39, 0.42, and 0.40

To investigate advantageous values of \(t_\) for a given data set, the user of our tool has the possibility to compute statistics on their input data. They have to fix the following three parameters beforehand: (1) the k-mer size, (2) the number of hash functions, and (3) the maximum false-positive rate (see the “Recommended choice of parameters” section). The statistics then give an estimation on the expected cost of the index size and query time of an HIBF for several values of \(t_\). As an example, we computed the statistics for the real-world data of the RefSeq database from our experiments and validated them by actually building and querying a respective HIBF (Fig. 9). The results show an excellent correlation between our prediction and the actual product of run times and memory consumption, namely a correlation of 0.95 for \(t_\) ranging from 64 to 8192. We were able to estimate the index size very closely to the real size, while we slightly underestimated the query time for increasing values of \(t_\). In the combination of space and query time, we identified \(t_ = 192\) as the optimal choice in both the expected total cost and real total cost, where 192 is a multiple of the word size 64 that is near the square root of the number of user bins b, i.e., \(sq=\lceil \sqrt/64\rceil \cdot 64\). In general, our tool computes the statistics for \(t_=sq,64,128,256,\ldots\) until the product \(\textit \cdot \textit\) increases. For the above example (Fig. 9), we would stop at \(t_=256\) since the product increases and choose \(t_=192\) as the optimal value.

In summary, given input data, we can compute a \(t_\) that minimizes the expected query time of the HIBF or the minimal expected run time weighted with memory consumption and free the user from choosing the parameter as long as the data in the UBs is not similar. We postulate that this strategy works well if a query is on average only contained in one (or a few) UBs.

Choice of w and k

To lower the space consumption and accelerate queries, we support (w, k)-minimizers. Other methods for shrinking the set of representative k-mers like described in [24] or [25, 26] are also possible for the HIBF. For a detailed description of (w, k)-minimizers, see [12]. The authors showed that, for up to 2 errors, parameters \(w=23\) and \(k=19\) perform well. In general, the larger we make w compared to k, the fewer representative k-mers we have to store. However, the accuracy decreases, and we might increase the number of false positives and false negatives. In comparison, Kraken2 [27] uses values of \(w=35\) and \(k=31\).

A k-mer is identified as a minimizer if it has the smallest value in any of the windows of length w which cover the k-mer. Following the argument of Edgar [26], we can define the compression factor c(w, k) to be the number of k-mers in a window divided by the number of minimizers. Hence, \(c(w,k)\ge 1\) and larger c(w, k) indicates a smaller set of representative k-mers. For (w, k)-minimizers, c(w, k) can be estimated as follows: Consider a pair of adjacent windows of length w in a random sequence. Sliding one position to the right, two k-mers are discarded from the first window (the k-mer and its reverse complement) and two are added to the second. The minimizer in the second window is different from the one in the first window if one of the four affected minimizers has the smallest value over both windows; otherwise, the minimizer of both windows is found in their intersection and does not change. There are \(2\cdot (w-k+2)\) k-mers in the two windows combined, and the probability that a given one of these has the smallest value is \(p = 1/(2\cdot (w-k+2))\). Thus, the probability that a new minimizer is introduced by sliding the window one position is \(4\cdot p = 2/(w -k + 2)\) and hence

$$\begin c(w,k)\approx \frac \end$$

For (23, 19)-minimizers, we would expect a compression by a factor of 3. For (32, 18)-minimizers, we would expect a compression factor of 8. Using a larger w is beneficial for saving space. On the other hand, the threshold for determining a match becomes smaller (roughly by the compression factor, for details see [12]). We want to avoid having a threshold of only 1 or 2, since a few false-positive k-mer matches would then result in a false-positive answer.

Impact of the HIBF’s false-positive rate

Recall that we ensure that the overall false-positive rate for querying an HIBF does not exceed a certain threshold. Note that, for a given false-positive rate \(p_\), the size in bits m of an individual Bloom filter is proportional to the amount of k-mers n it is designed to store. If h is the number of hash functions, then it holds:

$$\begin m \approx -\frac)} \end$$

Since we aim at making each IBF in the HIBF as small as possible, we will have to accommodate a relatively high false-positive rate.

We use the counts of the (w, k)-minimizers and the probabilistic threshold derived in [12] to decide whether a query is in a bin. However, having a relatively high false-positive rate \(p_\) will affect this thresholding. If a query has x minimizers, we can compute the probability that a of them return a false-positive answer in an IBF with false-positive rate \(p=p_\) as

$$\begin b_p(x,a)=\left( x\\ a\end}\right) \cdot p^a \cdot (1-p)^ \end$$

This can be quite significant. For example, for queries of length 250, we have the distribution of minimizers (computed for 1 million random reads) depicted in columns 1 and 2 of Table 2.

Table 2 Exemplary threshold distribution. The values are for (38, 20)-minimizers, 2 errors, and 1 million reads of length 250. Shown are the distribution \(\#x\) (\(\%x\)) of the number (percentage) of minimizers reaching from \(x=14\) to \(x=35\) and the threshold t(x) using the probabilistic model from [12]. The threshold t(x) incorporates the correction term \(c_p\). On the right, the probability \(b_p(x,a)\) of having a false-positive answers from the IBF with \(p=0.05\) is shown

The table shows that for a false-positive rate \(p_\) of 0.05, we encounter reads that have, e.g., 20 (38, 20)-minimizers. Those reads have a chance of almost 40 % to have one false-positive count and a chance of almost 19 % to have two. This has a small effect on completely random hits. However, hits that match the query with more than the allowed errors could reach the threshold due to one or two false-positive minimizers.

To counter this, we introduce a correction term for the thresholds. We add a constant c to the threshold t(x), which is determined by increasing c until the probability \(b_(x,c)\) drops below a threshold \(p_\). For example, for \(p=0.05\) and \(p_=0.15\), we have a correction of \(+1\) for \(14\le x\le 16\) and a correction of \(+2\) for \(17\le x\le 33\). For \(p=0.02\), we have a correction of \(+1\) for all x. The value for \(p_\) was experimentally determined using the benchmark from [12] such that the benchmark yielded no false positives and no false negatives. It is set to \(p_=0.15\). Those corrections are precomputed and incorporated into the thresholding.

留言 (0)

沒有登入
gif