Sage Journals: Discover world-class research

Abstract

Introduction:

Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled k-mers, from two closely related sequences. The analysis extracted a point mutation parameter θ quantifying the evolutionary distance between the two sequences.

Methods:

We extend the results of Blanca et al. for complete sketches to parametrized syncmer sketches with downsampling. A syncmer sketch can sample k-mers much more sparsely than a complete sketch. Consider the following simple mutation model disallowing insertions or deletions. Consider a reference sequence A (e.g., a subsequence from a reference genome), and mutate each nucleotide in it independently with probability θ to produce a mutated sequence B (corresponding to, e.g., a set of reads or draft assembly of a related genome). Then, syncmer counts alone yield an approximate Gaussian distribution for estimating θ. The assumption disallowing insertions and deletions motivates a check on the lengths of A and B. The syncmer count from B yields an approximate Gaussian distribution for its length, and a p-value can test the length of B against the length of A using syncmer counts alone.

Results:

The Gaussian distributions permit syncmer counts alone to estimate θ and mutated sequence length with a known sampling error. Under some circumstances, the results provide the sampling error for the Mash containment index when applied to syncmer counts.

Conclusions:

The approximate Gaussian distributions provide hypothesis tests and confidence intervals for phylogenetic distance and sequence length. Our methods are likely to generalize to sketches other than syncmers and may be useful in assembling reads and related applications.

1. INTRODUCTION

Next-generation sequencing has expanded nucleic acid databases so rapidly that some sequence comparisons now occur on the petabase scale (Edgar et al., 2022; Schmidt and Hildebrandt, 2017). At this scale, a sequence is often summarized by a sketch, a set of short oligonucleotides called submers, judiciously selected from the sequence (Edgar, 2021). Sketches can bypass slow computations like assembly and alignment with alignment-free methods, for example, they can compare sequences by exact matching of submers, possibly with hashes or probabilistic methods like Bloom filters.

Call an oligonucleotide of length k a k-mer. The set of all k-mers in a sequence can be viewed as a complete sketch. In contrast, given a hash function from k-mers to numbers, the MinHash sketch uses only the k-mers producing the s smallest hashes to summarize a sequence. Several central limit theorems (CLTs) relevant to the complete and MinHash sketches (Blanca et al., 2022) are available to sharpen the many applications of k-mer-based methods, including database search (Harris and Medvedev, 2020; Solomon and Kingsford, 2016), metagenomic sequence comparison (Wood and Salzberg, 2014), and alignment-free sequence comparison (Ondov et al., 2016; Sarmashghi et al., 2019; Song et al., 2014). For fixed size s, however, the s submers in a MinHash sketch become less dense within the sequence as the sequence length grows.

Motivated by applications better served by sketches that approximate a fixed density throughout a sequence (see, e.g., Manber, 1994), investigators have developed several other selection rules for submers (Shaw and Yu, 2022). On one hand, some sketches (universal hitting sets, polar sets, etc. [Orenstein et al., 2017; Zheng et al., 2021]) do not require explicit probability models, the foundation of our methods. On the other hand, probability models have usefully analyzed sketches based on submers like minimizers (Roberts et al., 2004; Schleimer et al., 2003), minimally overlapping words (Frith et al., 2021), and syncmers (Edgar, 2021). Minimizers are among the oldest submers (Roberts et al., 2004; Schleimer et al., 2003), and they and related techniques have many mature applications such as read mapping (Li, 2018), taxonomy (Wood et al., 2019), and sequence assembly, among others (Roberts et al., 2004; Sommer et al., 2007; Ye et al., 2012). Other submers may eventually prove superior to minimizers in such applications, so the present article presents CLTs pertinent to a sketch of current interest, based on parametrized syncmers (Dutta et al., 2022). Like the complete and MinHash sketches (Blanca et al., 2022), syncmer sketches can provide an estimate of average nucleotide divergence (Ondov et al., 2016) to which we add sampling errors, enriching applications to phylogeny reconstruction (Morgenstern et al., 2015).

Originally, Edgar (2021) emphasized two types of syncmers, closed and open. Dutta et al. (2022) generalized Edgar’s ideas elegantly with parametrized syncmers. Both articles note that a simple technique, random downsampling, can reduce syncmer density as desired. The present article embraces the full generality of parametrized syncmers under possible downsampling. Henceforth, “syncmers” usually refers to parametrized syncmers.

Blanca et al. (2022) produced some CLTs for complete k-mer sketches, so the present article presents Gaussian approximations relevant to unordered sets of syncmers. Empirical evidence suggests that the approximations are useful, and the SI (Supplementary Information) derives them through “proofs” containing uncontrolled approximations. In a mild abuse of language, therefore, we refer to our Gaussian approximations as CLTs.

Let the simple mutation model of the Abstract generate a mutated sequence B from sequence A, where A is a reference sequence or some subsequence from it. Our main theorem, the conserved syncmer CLT, estimates with sampling error the phylogenetic distance between A and B. We also give a syncmer CLT that estimates with sampling error the length of the mutated sequence. Syncmer counts alone yield a p-value that can flag the presence of statistically significant insertions or deletions for further investigation. Our code implements the conserved syncmer and syncmer CLTs.

As motivation, our conserved syncmer CLT has at least one promising future application. Consider the problem of mapping a query sequence onto a reference genome. The reference genome should be chosen from among alternatives by minimizing its phylogenetic distance to the query sequence. If the query sequence is represented by a set of unassembled reads without sequencing errors (under present technology, an unrealistic assumption) and without gaps in coverage, it yields a sketch consisting of an unordered set of syncmers. Our conserved syncmer CLT can estimate phylogenetic distance by comparing the unordered sketches from the query sequence and the genomes. The syncmer CLT can also estimate the length of the query sequence from its unordered sketch, without requiring assembly. Our CLTs furnish sampling errors for all their estimates. Most applications of the CLTs require a k-mer uniqueness condition (stated formally in Section 2.1): informally, within each pairwise comparison of query and reference sequences, identical k-mer sequences usually imply k-mer homology.

Sequencing errors obstruct mapping of unassembled reads as a present application because the reads require arduous postprocessing and comparison to remove their errors. A syncmer sketch is a subset of the complete sketch, however, so its selection may reduce the postprocessing required to remove relevant errors from unassembled reads. Regardless, the future application of our results to unassembled sequences does not exclude other more accessible applications. For example, the conserved syncmer and syncmer CLTs still apply if the query sequence is represented by a (relatively) error-free draft genome.

For practicality, the main text focuses on the concepts required to use our code knowledgeably. In overview, it makes some educated guesses about Gaussian approximations relevant to syncmers and then verifies the accuracy of the guesses empirically. Although the statements and their verifications can be understood to the level of detail given, the SI indicates the intuition behind the approximations and contains the details of computing the Gaussian parameters. Section 2 states our two CLTs informally using aggregate parameters such as means and variances. It summarizes the mathematical results used in the code, deferring detailed technical derivations and lengthy formulas to the Supplementary Data S1. Our GitHub site http://tinyurl.com/syncmer-clt also explains the detailed running of the code. Section 3 assesses the accuracy of the CLTs empirically with simulations. It also examines the practical performance of the CLTs in determining the lengths of 25 mitochondrial sequences and their average nucleotide identity (ANI) when compared with a reference mitochondrial sequence for Homo sapiens neanderthalensis. The ANIs were computed from syncmer counts, so all mitochondrial genomes (including the reference genome) could have been represented by unassembled reads.

The SI contains some ancillary results of independent interest, for example, it gives the distribution of syncmer overlaps. Its main purpose, however, is to demonstrate our CLTs under the assumption that every k- and s-mer in the relevant sequences is unique. Unlike Blanca et al. (2022), we also offer several heuristics to control the practical errors in our CLTs caused by k- and s-mer replicates. When optimizing syncmer parameters for specific purposes, the heuristics can speed empirical experimentation with k- and s-mer lengths, so Sections 3 and 4 state and examine the heuristics.

2. METHODS

String notations: Let “ $: =$ ” denote a definition. Anticipating the 0-offsets used throughout this article and in our computer programs, define the set notations $[i : j) : = [i : j - 1] : = {i, i + 1, ..., j - 1}$ for any integers $i < j$ . Let Σ be a finite alphabet of letters, in applications usually the nucleotide alphabet Σ = {a,c,g,t}. Consider strings $A, B \in Σ^{L + k - 1}$ , and let $A [i]$ denote the ith letter in $A$ ; and $A [i : i + k) : = (A [j] : j \in [i : i + k))$ , the ith k-mer in $A$ . The set $[i : i + k)$ of indices in the k-mer is called the ith k-span (in $A$ ). When we need to focus on the letters $A [i : i + k)$ without regard to the underlying k-span $[i : i + k)$ , we refer to the letters as the k-mer sequence. Following custom, however, we often conflate a k-mer and its sequence without comment. The k-mers corresponding to the ith k-span of A and the jth k-span of B match if the corresponding sequences are equal: $A [i : i + k) = B [j : j + k)$ , where possibly A = B. If a k-mer matches no other k-mer (context determines the relevant set of strings here, e.g., A and B), the k-mer is unique.

2.1. The simple mutation model—prior work

To formalize Section 1’s description, consider two strings of equal length, one a reference sequence $A \in Σ^{L + k - 1}$ , the other a related mutated sequence $B \in Σ^{L + k - 1}$ . Note that to promote brevity throughout the article, L is not the length of A, but rather its k-mer count. Sometimes, B plays the role of a query sequence whose evolutionary distance from A we wish to quantify. In the simple mutation model (Blanca et al., 2022), a mutation process acts on A to produce B by altering each letter in A independently with mutation probability $0 < θ < 1$ per letter. Thus, the complementary conservation probability per letter $\bar{θ} : = 1 - θ$ .

Applications of the simple mutation model usually require some version of the following k-mer uniqueness condition, mentioned in Section 1. For every k-mer sequence x, let the multiplicity $m_{A} (x)$ count the k-spans in sequence A with k-mer sequence x; $m_{B} (x)$ , in sequence B (imagined after assembly of unassembled reads, if necessary). Formally, in the present context, the k-mer uniqueness condition states: (1) $m_{A} (x) = m_{B} (x) = 1$ iff (if and only if) x is the sequence of a k-mer in A that passed to B unmutated; (2) $m_{A} (x) = 1$ and $m_{B} (x) = 0$ iff the k-mer sequence x in A was mutated in passing from A to B; and (3) $m_{A} (x) = 0$ and $m_{B} (x) = 1$ iff the k-mer sequence x in B resulted from a mutation in passing from A to B. Informally, each k-mer in A has a unique sequence and a mutated k-mer in B never matches a k-mer already in A or another mutated k-mer in B. Even if a few k-mers violate the k-mer uniqueness condition, our CLT approximations remain useful.

In some applications, B is homologous to A within a larger genome A*, requiring an extension of the k-mer uniqueness condition to A^* outside A. The extended k-mer uniqueness condition includes the k-mer uniqueness condition, along with (4) if $m_{A^{*} \ A} (x) > 0$ then $m_{A} (x) = m_{B} (x) = 0$ , that is, k-mers in A^* outside of A (i.e., in $A^{*} \ A$ ) must not match k-mers in A or B.

2.2. Parametrized syncmers—prior work

2.2.1. k-Mer order

Given $o_{k} : Σ^{k} \to ℝ$ , a one-to-one hash function on k-mers, and two $k$ -mers $x_{1}$ and $x_{2}$ , define $x_{1} \leq x_{2}$ if $o_{k} (x_{1}) \leq o_{k} (x_{2})$ . Examples of k-mer order include lexicographic order and random order. Henceforth, this article uses $o_{k}$ (or $o_{s}$ ) to denote a fixed random hash on k-(or s-)mers.

Some hashing applications to double-stranded DNA consider two k-mers sequences equivalent if they are reverse complements of each other. Canonical hash functions map reverse complementary k-mers sequences $x, \bar{x} \in Σ^{k}$ to the same hash value $o_{k} (x) = o_{k} (\bar{x})$ and may then be appropriate. In fact, if $Π$ is any partition of $Σ^{k}$ , then $o_{k} : Π \to ℝ$ could replace $o_{k} : Σ^{k} \to ℝ$ below without loss of generality. For simplicity, however, the following assumes a one-to-one hash function $o_{k} : Σ^{k} \to ℝ$ .

2.2.2. Syncmer sketches

The submers relevant to the present article are parametrized syncmers (Dutta et al., 2022). Although the following describes them in full generality, a reader who understands only open syncmers (Edgar, 2021) can understand our CLTs. Our notations mostly follow Edgar (2021), who invented syncmers, and Dutta et al. (2022), who introduced parametrized syncmers.

To describe the selection rule for parametrized syncmers (Dutta et al., 2022), fix a k-mer size k >1 and consider a random one-to-one hash function $o_{s} : Σ^{s} \to ℝ$ on s-mers, where s < k. The hash function $o_{s}$ orders s-mers, so each k-mer contains at least one minimum s-mer. We call the minimum s-mers the s-minimizers of the k-mer. Define $u : = k - s$ , so every k-mer contains $u + 1$ s-mers, each at a different (0-)offset in $[0 : u]$ . A syncmer selection rule requires specification of a distinguished subset within the $u + 1$ offsets $[0 : u]$ . If the k-mer has an s-minimizer at a distinguished offset, then the k-mer is a syncmer. The next paragraph gives formal definitions.

Formally, for a k-mer with span $[i : i + k)$ , every s-mer within it starts at one of the offsets $j \in [0 : u]$ , that is, every s-mer within the ith k-mer has a span $[i + j : i + j + s)$ ( $j \in [0 : u]$ ). Fix a subset $Ω : = {ω_{i} : i = 1, 2, ..., n} \subseteq [0 : u]$ of n distinguished offsets. A syncmer selection rule f with parameters $(k, s, o_{s}, Ω)$ selects a k-mer if the minimum s-mer within the k-mer starts at some distinguished offset $ω \in Ω$ . More formally (Dutta et al., 2022), $f (A) = S_{k, s, o_{s}, Ω} (A) = {j \in [0 : L) : \underset{ω \in [0 : u]}{\arg \min} A [j + ω : j + ω + s - 1) \in Ω}$ (1)where argmin uses the s-mer order derived from the hash function $o_{s}$ .

Some specific examples of syncmers follow. Closed syncmers have an s-minimizer at their end, that is, $Ω = {0, u}$ ; open syncmers have an s-minimizer at a specified offset $t \in [0 : u]$ , that is, $Ω = {t}$ (Edgar, 2021). For brevity, when context specifies $o_{s}$ , we call the two special cases $(k, s)$ -closed syncmers and $(k, s, t)$ -open syncmers. Given the floor function $⌊ x ⌋ : = \max {i \in Z : i \leq x}$ , a mid-open syncmer is a $(k, s, ⌊ u / 2 ⌋)$ -open syncmer. Mid-open syncmers are optimal for some purposes when matching or aligning sequences (Shaw and Yu, 2022).

An s-minimizer tie may occur if the two different minimum s-mers within a k-mer match. By convention, s-minimizer ties are typically broken in favor of the leftmost s-minimizer but regardless, s-minimizer ties may be negligible if s is large enough. For simplicity, therefore, the theory below assumes that the s-minimizer within a k-mer is unique. Sections 3 and 4 revisit the (mild) practical discrepancies caused by s-minimizer ties.

2.2.3. Downsampled syncmers

Downsampling can reduce the size of a syncmer sketch by discarding syncmers randomly but reproducibly with fixed probability $0 \leq ε < 1$ . The next paragraph describes downsampling formally, but it is not essential to understanding our CLTs.

Formally, let $U_{k} o_{k} : Σ^{k} \to [0, 1]$ be the functional composition of a random hash function $o_{k} : Σ^{k} \to ℝ$ with a function $U_{k} : ℝ \to [0, 1]$ , judiciously chosen so $U_{k} o_{k} (x)$ is (approximately) uniformly distributed on the real interval [0,1] (e.g., see the section “Downsampled syncmers” in Edgar [2021]). Then, $U_{k} o_{k}$ can be used to reject $k$ -mers x randomly and reproducibly whenever $U_{k} o_{k} (x) > 1 / δ$ , making the rejection probability $ε = 1 - 1 / δ$ , where $δ$ is the so-called downsampling rate. Downsampling therefore reduces the density of a submer selection rule by $δ$ . Typically, downsampling uses a judicious choice of $o_{k}$ to ensure that downsampling is probabilistically independent of the submer selection rule (Edgar, 2021).

2.3. The Gaussian approximation for the k-mer count L

Some preliminary terminology relating sets and multisets is useful. If $S$ is any sequence, it may be decomposed into a multiset of k-mers, with support $Supp (S) : = {x : m_{S} (x) > 0}$ being the set underlying the multiset. The standard multiset cardinality $| S | : = \sum_{x \in Supp (S)} m_{S} (x)$ counts the elements with multiplicities, so let ${| S |}_{1} : = | Supp (S) |$ denote the cardinality without multiplicities.

Let $S (A)$ denote the multiset of downsampled $(k, s, o_{s}, Ω)$ -syncmers in A, that is, $S (A)$ is an unordered list of syncmer sequences x, where $m_{A} (x)$ counts the copies of x in $S (A)$ (see Section 2.1); define $S (B)$ and $m_{B} (x)$ similarly. Often in applications, the multiset $S (B)$ is unknown, but the underlying set $Supp (S (B))$ (without multiplicities) is known (Ondov et al., 2019). The k-mer uniqueness condition permits us to ignore the distinction between $S (B)$ and $Supp (S (B))$ .

Now, consider a sequence $B \in Σ^{L + k - 1}$ , so B contains exactly L k-mers. Under random hashing, syncmers are known to provide an estimate of L, as follows. For brevity, let $u = k - s$ . Let $| Ω |$ count the elements in $Ω$ , and assume s is chosen large enough that we may neglect s-minimizer ties. As an example of the foregoing, then, mid-open syncmers have their unique s-minimizer at the (0-)offset $⌊ u / 2 ⌋ \in [0 : u]$ , with $| Ω | = 1$ .

Let the subscript $Y$ connote a variable related to syncmers and $μ_{Y}$ connote syncmer density. In particular, $μ_{Y} = | Ω | / (u + 1)$ : if a random k-mer is a syncmer, the offset of an s-minimizer was chosen uniformly at random from $[0 : u]$ and fell in $Ω$ (Shaw and Yu, 2022). Under the k-mer uniqueness condition, the point estimate $\hat{L} = {| S (B) |}_{1} / [μ_{Y} (1 - ε)],$ (2)therefore, approximates the k-mer count L. Our CLTs sharpen the point estimate by quantifying the sampling error. The following uses a typical notation, where the random variate $Z$ has a standard normal distribution so $ℙ (Z \leq z) = Φ (z)$ , where $Φ (z) = {(2 π)}^{- 1 / 2} \int_{- \infty}^{z} e^{- y^{2} / 2} d y$ , and $Φ (z_{α}) = 1 - α$ .

In the following, $σ_{Y, ε}^{2}$ and $γ_{Y, ε}$ are constants related to downsampled syncmers (hence, $Y$ and $ε$ appear as subscripts). As in the Wilson score test (Blanca et al., 2022; Wilson, 1927), define the random variate $W (L) : = \frac{{| S (B) |}_{1} - L [μ_{Y} (1 - ε)]}{\sqrt{L σ_{Y, ε}^{2} + γ_{Y, ε}}},$ (3)where the SI shows how to compute the constants $σ_{Y, ε}^{2}$ and $γ_{Y, ε}$ by neglecting s-minimizer ties and then analyzing syncmer overlaps. The “Hoeffding–Robbins CLT” in the SI specializes the general CLT in Hoeffding and Robbins (1994) to the context of k-mers. It shows that as the k-mer count L increases to the limit infinity, the distribution of $W (L)$ approaches a standard normal distribution. Our numerical methods guarantee a solution for the approximate $1 - α$ confidence interval (CI) for L from the equation $ℙ (- z_{α / 2} \leq W (L) \leq z_{α / 2}) = 1 - α,$ (4)under the approximation that $W (L)$ has the limiting standard normal distribution. The approximate $1 - α$ CI for L also provides hypothesis testing against specific values of L.

2.4. The Gaussian approximation for the mutation probability θ

Consider a reference sequence $A \in Σ^{L + k - 1}$ and a mutated sequence $B \in Σ^{L + k - 1}$ , both containing L k-mers, under the simple mutation model. Let us estimate the mutation probability $θ$ per letter, or equivalently, the conservation probability $\bar{θ} : = 1 - θ$ .

In the following, $σ_{Y, Θ, ε}^{2} (θ)$ and $γ_{Y, Θ, ε} (θ)$ are the values of functions related to downsampled syncmers under mutation (hence, $Y$ , $Θ$ , and $ε$ appear as subscripts). In contrast to $σ_{Y, ε}^{2}$ and $γ_{Y, ε}$ in Eqs (3) and (4), $σ_{Y, Θ, ε}^{2} (θ)$ and $γ_{Y, Θ, ε} (θ)$ depend on $θ$ . Let $S (A) \cap S (B)$ denote the (unordered) set of downsampled syncmers common to $A$ and $B$ . Let $W (θ, L) : = \frac{{| S (A) \cap S (B) |}_{1} - {\bar{θ}}^{k} {| S (B) |}_{1}}{\sqrt{L σ_{Y, Θ, ε}^{2} (θ) + γ_{Y, Θ, ε} (θ)}}$ (5)where the SI shows how to compute $σ_{Y, Θ, ε}^{2} (θ)$ and $γ_{Y, Θ, ε} (θ)$ by neglecting s-minimizer ties and then analyzing syncmer overlaps in the simple mutation model. Even if A is a substring in some larger sequence A^*, $S (A) \cap S (B)$ is observable because the extended k-mer uniqueness condition implies that the syncmer set $S (A) \cap S (B) = S (A^{*}) \cap S (B)$ . Under the k-mer uniqueness condition, the SI shows that according to the Hoeffding–Robbins CLT, as the k-mer count L increases to the limit infinity, the distribution of $W (θ, L)$ approaches the standard normal distribution. Thus, the equation $ℙ (- z_{α / 2} \leq W (θ, L) \leq z_{α / 2}) = 1 - α,$ (6)thereby providing both an approximate $1 - α$ CI for $θ$ and hypothesis tests for specific values θ = θ₀. In particular, the Wilson score test still applies despite the additional dependence that $σ_{Y, Θ, ε}^{2} (θ)$ and $γ_{Y, Θ, ε} (θ)$ have on $θ$ . In the random variate $W (θ, L)$ , even if the reference k-mer count L is unknown, Eq (2) provides the point estimate $\hat{L}$ . Under the k-mer uniqueness condition, the estimate $\hat{L}$ satisfies $\hat{L} / L \to 1$ (convergence in probability), showing that $\hat{L}$ can substitute for L without affecting the CLT.

Given the exact L or its estimate $\hat{L}$ , even without convexity or concavity, exhaustive numerical search by mesh on the finite domain $θ \in [0, 1]$ finds bounds on $θ$ to biologically useful accuracies. As a numerical procedure, the mesh search is not infallible, but in all cases examined so far, $W (θ, L)$ appears to be an increasing function of $0 < θ < 1$ , and all computations have run without apparent error.

Again, the equation $W (\hat{θ}, L) = 0$ yields ${(1 - \hat{θ})}^{k} = {| S (A) \cap S (B) |}_{1} / {| S (B) |}_{1}$ (cf. Eq (3) in Morgenstern et al. [2014]) and an estimator $\hat{θ} = 1 - {(\frac{{| S (A) \cap S (B) |}_{1}}{{| S (B) |}_{1}})}^{1 / k}$ (7)the complement of the Mash containment index $c_{k}$ in Ondov et al. (2019), but with syncmers. When disallowing indel mutations, therefore, the Mash containment index quantifies point mutations, and Eqs (5) and (6) estimate the corresponding sampling error for syncmers.

2.5. Simulation of CIs

Eqs (3) and (5) can contribute steps to a bioinformatics pipeline. Our code implements Eqs (3) and (5) verbatim, inputting the k-mer count L and not the sequence length L + k − 1. In anticipation of actual usage, unless noted otherwise throughout Section 3, the sequence length L + k − 1 is substituted for the k-mer count L in both code input and interpretation of code output. Section 3 therefore implicitly examines when the difference k − 1 is negligible in practice. To anticipate, if k is a few percent of L + k − 1, our CLTs can conflate the sequence length and k-mer count without harm.

In Eq (3), the downsampled syncmer count ${| S (B) |}_{1}$ is an observable yielding $1 - α$ CIs for L in Eq (4). To compare theoretical L-CIs (i.e., CIs for L) with simulations at lengths L + k − 1 = 10 ⁿ (n = 2, 3, 4, 5), the package noverlap (Frith et al., 2021) (version 2.4.1) chose letters independently from a uniform distribution on the nucleotide alphabet {a,c,g,t} to generate 1000 random sequences of the desired length. Within the random sequences, noverlap then used murmur64 as a randomized hash $o_{s}$ to identify and count mid-open syncmers. The program noverlap breaks s-minimizer ties by insisting that the leftmost s-minimizer is in the mid-position. The syncmers were not downsampled (i.e., $ε = 0$ ). After syncmer selection, the 1000 sampled sequences and Eq (3) then provided an estimate of the L-CI from Eq (4).

Similarly, the observable syncmer counts ${| S (B) |}_{1}$ and ${| S (A) \cap S (B) |}_{1}$ in Eq (5) for $W (θ, L)$ yielded $1 - α$ $θ$ -CIs in Eq (6). To compare $θ$ -CIs to simulation results at lengths L + k − 1 = 10 ⁿ (n = 2, 3, 4, 5) and mutation probabilities $θ$ = 0.05, 0.15, 0.25 per letter, the package noverlap (Frith et al., 2021) generated 1000 random sequences of the desired length. For each mutation probability $θ$ , the program simulate_nucleotide_errors.py (Blanca et al., 2022) then introduced independent Bernoulli point mutations into each of the 1000 reference sequences to generate a corresponding mutated sequence. Within the reference and mutated sequences, noverlap then used murmur64 as a randomized hash $o_{s}$ and identified mid-open syncmers. The 1000 sampled pairs of reference and mutated sequences in Eq (5) then provided an estimate of the $θ$ -CI.

2.6. Estimates with sampling errors for mitochondrial lengths and mutation probabilities

Syncmer sketches bypass computationally intensive methods like alignment and assembly while producing many similar biological results. To validate our results, we compared them with the gold standard of assembled sequences, as follows.

For reference and query sequences, GenBank yielded the complete mitochondrial genomes of Homo sapiens neanderthalensis and 24 other taxa. The file “mitochondria.csv” in the GitHub repository gives taxonomic names, common names, and the National Center for Biotechnology Information (NCBI) accession numbers for the taxa and their mitochondrial sequences. Along with great apes, monkeys, and other mammals, the 24 taxa also included the following nonmammalian taxa: penguin (Eudyptes chrysolophus), snake (Trimerodytes annularis), crocodile (Crocodylus porosus), turtle (Graptemys ouachitensis), frog (Bufo gargarizans), fish (Latimeria menadoensis), jellyfish (Pelagia noctiluca), and sea urchin (Echinocrepis rostrata).

We accessed NCBI servers on November 15, 2023, at the (abbreviated) URL http://tinyurl.com/blastn-blast2seq. They estimated the pairwise Basic Local Alignment Search Tool (BLAST) ANI between our reference and query mitochondrial sequences. Our programs also estimated the complement of the Mash containment index in Eq (7) from syncmer counts, as well as estimating 95% $θ$ -CIs for the ANI with Eq (6) for syncmers.

The mitochondrial reference sequence for Homo sapiens neanderthalensis has length 16565 (all sequence lengths in nucleotides). To detect insertions and deletions in the mitochondria of the 24 taxa relative to the assembled reference length 16565, Eqs (3) and (4) provide hypothesis tests against fixed lengths. To verify the results of the hypothesis test, syncmers provide a natural estimator for sequence length from Eq (2) and a 95% CI from Eqs (3) and (4), which we then plotted against the gold standard of the true mitochondrial lengths of the 24 taxa.

3. RESULTS

This section examines only mid-open syncmers, because of their optimality properties (Shaw and Yu, 2022). The CIs above take milliseconds to calculate. Our simulations therefore evaluate only statistical accuracy, not computational speed. The R programming language was used to graph the simulation results.

As described in Section 2.5, throughout this section our code usage substituted the sequence length L + k − 1 for the k-mer count L for both input and output. To anticipate, if k is a few percent of L + k − 1, the sequence length and k-mer count can be conflated in our CLTs without harm.

3.1. Simulated CIs for the k-mer count L

Section 2.5 describes how to sample 95% L-CIs. Figure 1 plots the sample average of the endpoints of the CI for lengths L + k − 1 = 10 ⁿ (n = 2, 3, 4, 5). The difference between the endpoints quantifies the precision of the CI.

FIG. 1.

Mid-open syncmers and expected 95% confidence intervals (CIs) for the length L + k − 1. Each of the four simulations shown generated 1000 random sequences of uniform nucleotide composition, with sequence length L + k − 1 indicated in the upper left. The X-axis indicates the s-mer size, with each k-mer size corresponding to a vertical pair of markers over the s-mer size. Over each s-mer size, from left to right the k-mer sizes are: 10 (red circles); 15 (green diamonds); 20 (brown triangles); 25 (blue squares); and 30 (black crosses inside circles). The markers display the sample averages of the endpoints of the 1000 95% CIs, with the Y-axis giving the ratio of the estimated k-mer count directly from the code to the true length. The horizontal dashed line indicates an ideal estimate, with ratio 1.0.

Figure 2 plots accuracy, the probability that the sampled CIs above contain the true length L + k − 1, for L + k − 1 = 10 ⁿ (n = 2, 3, 4, 5).

FIG. 2.

Mid-open syncmers and accuracy of 95% CIs for the length L + k − 1. Each of the four simulations shown generated 1000 random sequences of uniform nucleotide composition, with the length L + k − 1 indicated in the upper left. The X-axis indicates the s-mer size, with each k-mer size corresponding to a vertical pair of markers over the s-mer size. Over each s-mer size, from left to right the k-mer sizes are: 10 (red circles); 15 (green diamonds); 20 (brown triangles); 25 (blue squares); and 30 (black crosses inside circles). The markers indicate the accuracy of the corresponding 95% CI. The horizontal dashed line indicates ideal accuracy, 0.95.

In Figure 1, which displays the sample-averaged endpoints of CIs for the length L + k − 1, the number of syncmers sampled is (approximately) proportional to L; and the standard deviation of the number, to $L^{1 / 2}$ , so the CLT should yield a L-CI whose relative width (shown vertically) is proportional to $L^{1 / 2} / L = L^{- 1 / 2}$ . At L + k − 1 = 10², the relative width is somewhat <1.0. As L + k − 1 increases by factors of 10 (L + k − 1 = 10 ⁿ for n = 2, 3, 4, 5), the relative width decreases by about a factor 10^−1/2, in accord with theoretical expectations.

Ideally, a 95% CI should give a 0.95 accuracy. Figure 2 shows that most L-CIs at L + k − 1 = 10², for example, are inaccurate. According to a classical CLT heuristic, a CLT approximation at p = 0.05 requires about 30 independent samples (Fisher, 1925, p. 80), possibly 20, and at a minimum 10 (Corder and Foreman, 2011). The syncmers number about L/(u + 1), where u = k–s. The classical CLT heuristic directly implies a Submer heuristic about the minimum number of submers that our CLTs require for accuracy (i.e., about 10–30). Thus, sequences of length L + k − 1 = 10² typically appear to produce too few syncmers to support an accurate CLT approximation.

The simulations used a uniform frequency of random letters, so the match probability from Supplementary Section S3 is q = 1/4. Supplementary Section S3 and Supplementary Eq (S17) contain a k-mer uniqueness heuristic indicating that L4⁻ ^k /2 must be small. For L + k − 1 = 10⁵ and k = 10 in Figure 2, $10^{5} 4^{- 10} / 2 \approx 0.05$ , demonstrating (unsurprisingly) that larger k-mer counts L require longer k-mers to maintain k-mer uniqueness.

Similarly, our calculation of the correlations in the CLTs neglects s-minimizer ties in syncmers. Supplementary Section S3.2 in the SI has a s-minimizer heuristic asserting that the quantity $(u + 1) 4^{- s} / 2$ should be small to avoid s-minimizer ties. For example, $(10 - 2 + 1) 4^{- 2} / 2 \approx 0.3$ for (k,s) = (10,2), $(30 - 3 + 1) 4^{- 3} / 2 \approx 0.2$ for (k,s) = (30,3), and so on, are likely too large, and Figure 2 shows that the corresponding L-CIs are inaccurate. Figure 2 also shows that s = 2 is often too small to produce accurate CIs, and moreover, the inaccuracies slowly become more pronounced as k-mer counts L increase and more information is available. Again, unsurprisingly, longer syncmers require longer s-mers to avoid s-minimizer ties.

In addition, for s = 2 in Figure 2, the accuracies for 2 ≤ s ≤ 6 display an odd-even s-alternation, that is, the accuracy decreases for the even s succeeding each odd s. The SI shows that in the absence of s-minimizer ties, the distance between consecutive mid-open syncmers exceeds $⌊ u / 2 ⌋$ , highlighting the discreteness of the distribution that the CLT approximates. Possibly, therefore, the syncmer CLT may require a continuity correction to account for discrete atoms of probability. Certainly, the CLT approximation should improve if the distance between consecutive syncmers can vary, suggesting that a deterministic distance between submers may be best for many applications (Shaw and Yu, 2022) but (unsurprisingly) not for CLTs. Thus, the CLT approximation may improve if u = k–s increases, increasing the variability of the distance between consecutive syncmers.

Based on these results, Section 4 contains some qualitative heuristics for selecting L, k, and s empirically to improve the precision and accuracy of L-CIs with mid-open syncmers. The SI also contains heuristics for avoiding s-minimizer ties.

3.2. Simulated CIs for the mutation probability θ per letter

Figure 3 displays the sample-averaged endpoints of the CIs for the mutation probability $θ$ per letter for lengths L + k − 1 = 10 ⁿ (n = 2, 3, 4, 5). The difference between the endpoints quantifies the precision of the CI.

FIG. 3.

Mid-open syncmers and expected 95% CIs for mutation probability θ = 0.15. Each of the four simulations shown generated 1000 random sequences of uniform nucleotide composition, with the length L + k − 1 indicated in the upper left. The X-axis indicates the s-mer size, with each k-mer size corresponding to a vertical pair of markers over the s-mer size. Over each s-mer size, from left to right the k-mer sizes are: 10 (red circles); 15 (green diamonds); 20 (brown triangles); 25 (blue squares); and 30 (black crosses inside circles). The markers indicate the accuracy of the corresponding 95% CI. The horizontal dashed line indicates ideal accuracy, 0.95.

Figure 4 plots accuracy, the probability that the sampled CIs above contain the true θ (ideally, a 95% CI gives 0.95 accuracy).

FIG. 4.

Mid-open syncmers and accuracy of 95% CIs for mutation probability θ = 0.15. Each of the four simulations shown realized 1000 random sequences of uniform nucleotide composition, with length L + k − 1 indicated in the upper left. The X-axis indicates the s-mer size, with each k-mer size corresponding to a vertical pair of markers over the s-mer size. Over each s-mer size, from left to right the k-mer sizes are: 10 (red circles); 15 (green diamonds); 20 (brown triangles); 25 (blue squares); and 30 (black crosses inside circles). The markers indicate the accuracy of the corresponding 95% CI. The horizontal dashed line indicates ideal accuracy, 0.95.

In Figure 3, the mutation probability $θ$ does not influence the depth of sampling like the k-mer count L, so the CLT should yield a $θ$ -CI with width proportional to $1 / L^{1 / 2} = L^{- 1 / 2}$ . The $θ$ -CIs for L + k − 1 = 10² and L + k − 1 = 10³ show a noticeable bias in the estimates of $θ$ , but once the bias diminishes, the width decreases by about a factor 10^−1/2 from L + k − 1 = 10⁴ to L + k − 1 = 10⁵, in accordance with theoretical expectations. For L + k − 1 = 10⁵ and k = 10 (leftmost, red intervals in the bottom graph), CIs are biased downward, likely displaying violation of the k-mer uniqueness condition in Section 2.1. The heuristic in Supplementary Eq (S19) suggests that $(1 - {\bar{θ}}^{k}) L q^{k} \approx (1 - {0.15}^{10}) 10^{5} 4^{- 10} \approx 0 .1$ should be small to avoid the k-mer collisions that decrease the apparent mutation rate. Indeed, the quantity 0.1 is large enough to account for the deviations for L + k − 1 = 10⁵ and k = 10 in Figure 3. The accuracy for L + k − 1 = 10⁵ and k = 10 in Figure 4 also reflects the same spurious matches. Here at least, limiting the spurious matches of mutated k-mers in A and B with the lower bound in Supplementary Eq (S19) appears more important than limiting replicates of conserved k-mers with the upper bound in Supplementary Eq (S18).

Figure 4 displays the accuracy of $θ$ -CIs. The conserved submer heuristic, a variant of the submer heuristic following Figure 2, readily explains the relative inaccuracies for L + k − 1 = 10² and 10³. For a mutation probability θ, the conserved submer heuristic posits that a CLT requires about 30 or so conserved syncmers, and the total conserved syncmers number only about ${\bar{θ}}^{k} L / (u + 1)$ , where u = k–s. Figure 4 also displays other contrasts with previous figures. Relative to the L-estimates in Figure 2, the θ-estimates in Figure 4 display fewer inaccuracies for short s-mers, suggesting that tied s-minimizers have more influence on the accuracies of L-CIs than θ-CIs.

Supplementary Figure S1 for θ = 0.05 and Supplementary Figure S2 for θ = 0.25 display accuracies analogous to Figure 4 for θ = 0.15. They illustrate the effect on accuracy of varying θ.

As previously stated, the conserved submer heuristic requires about 10–30 conserved syncmers. The approximate count is about ${\bar{θ}}^{k} L / (u + 1)$ . In Figure 4, Supplementary Figure S1, and Supplementary Figure S2, the count correctly anticipates the general increase in θ-CI accuracies with k-mer count L and the general decrease in θ-CI accuracies with increased mutation θ = 0.05, 0.15, and 0.25. At L + k − 1 = 10⁵ and k = 10, with θ = 0.15 or 0.25, the anomalous inaccuracy of the θ-CI likely reflects a different problem: violation of the k-mer uniqueness heuristic. In Supplementary Eq (S19), the bound $(1 - {\bar{θ}}^{k}) (L q^{k}) \approx (1 - {0.85}^{10}) (10^{5} 4^{- 10}) \approx 0.08$ does not limit mutated k-mer collisions enough, as is the case at L + k − 1 = 10⁴ and k = 10, with θ = 0.25: $(1 - {\bar{θ}}^{k}) (L q^{k}) \approx (1 - {0.75}^{k}) (10^{4} 4^{- 10}) \approx 0.1$ .

The inaccuracies in θ-estimates in Supplementary Figure S1 and Supplementary Figure S2 are not confined specifically to short s-mers, suggesting that as in the θ-estimates in Figure 4, s-minimizer ties are not influential.

3.3. Estimates with errors for mitochondrial lengths and their mutation probabilities

Excel was used to graph mitochondrial results, all derived from mid-open (k,s) = (13,7)-syncmers without downsampling ( $ε = 0$ ). The three graphs below label the taxa of four points, so the reader can compare the results in the graphs for the four taxa.

Figure 5 plots the syncmer ANI (i.e., 1 − $\hat{θ}$ from Eq (7)) against the BLAST ANI for each of the 25 mitochondria in the SI. The points yield an unweighted correlation coefficient r = 0.930, with r² = 0.866. The error bars show that much of the deviation from the diagonal line reflects syncmer sampling error. In Figure 5, the four labeled points belong to taxa that yield statistical significance in the hypothesis test of Figure 6.

FIG. 5.

ANI from syncmer counts vs. ANI calculated by BLAST. This figure plots the syncmer ANI [1-θ from Eq (7)] against the BLAST ANI for each of the 25 mitochondria in the SI. On the upper right at (1.0, 1.0) is the reference mitochondrial genome sequence A, Homo sapiens neanderthalensis. Mutation increases away from the reference toward the lower left. Without error bars, the deviations from the diagonal line Y = X might appear disconcerting, particularly for Pelagia noctiluca, but the error bars on the syncmer ANI, the CIs from Eq (6), widen as mutation increases. The widening improves the consistency of the two measures of ANI, but discrepancies remain. The algorithm producing the syncmer ANI has a probabilistic interpretation given in the text, however, unlike the algorithm producing the BLAST ANI. ANI, average nucleotide identity; BLAST, Basic Local Alignment Search Tool.

FIG. 6.

p-Value for the reference length yielding the syncmer count vs. the BLAST ANI. This figure plots the p-value based on Eqs (3) and (4) on the Y-axis against the BLAST ANI on the X-axis for each of the 25 mitochondria in the SI. The BLAST ANI facilitates the reader’s ability to correlate points between Figures 5 and 6. The p-value tests whether the syncmer count from the corresponding mitochondrial sequence B comes from the same mitochondrial length as the reference mitochondrial sequence A under the probability models of the text. It therefore can test for the presence of insertions or deletions in passing from the reference to the query sequence.

Figure 6 labels four mitochondrial genomes. The corresponding p-values show that the stated probability models likely require insertions or deletions in the reference mitochondrial sequence A from Homo sapiens neanderthalensis to produce the syncmer counts in the query sequences B.

In Figure 7 and in this paragraph, all lengths are in nucleotides. The points yield an unweighted correlation coefficient r = 0.760, with r² = 0.578. In Figure 7, most points lie close to the diagonal dotted line Y = X, indicating accurate length estimates, with the point (16916, 16247) farthest below it (and corresponding to C. porosus). To explain the anomaly, consider the multiset of syncmers from a mitochondrial sequence (multiset, because some syncmer sequences may appear several times within the mitochondrial sequence). The Excel file “mitochondria.csv” in the GitHub repository shows that all 25 mitochondria have 21 or fewer replicate syncmers, except C. porosus (44), E. chrysolophus (41), Ornithorhynchus anatinus (51), and T. annularis (64). The C. porosus mitochondrion is the shortest of the four exceptions (16916 compared with 17059, 17019, and 17511, respectively). It has a repetitive, simple sequence of about 500 nt near its end (see Supplementary Section S4). The C. porosus anomaly emphasizes the importance of syncmer uniqueness to sequence matching.

FIG. 7.

Syncmer mitochondrial length estimate with 95% CI vs. actual mitochondrial length. In this figure, all lengths are in nucleotides. The Y-axis indicates the mitochondrial length estimates along with their 95% CIs from Eqs (2 )–(4) and the syncmer counts. The X-axis indicates the assembled mitochondrial length, usually unavailable in applications. The plot also displays a horizontal line corresponding to the length L + k − 1 = 16565 of the mitochondrial reference, Homo sapiens neanderthalensis.

4. DISCUSSION

Throughout, the present article assumes a simple mutation model excluding insertions and deletions (Blanca et al., 2022). It then presents CLTs for syncmers in a sequence and for syncmers conserved across both reference and query sequences. On one hand, the conserved syncmer CLT quantifies a phylogenetic distance with sampling errors as a mutation probability corresponding to an ANI. The conserved syncmer CLT provides sampling errors for the Mash containment index (Ondov et al., 2019) for syncmers. The conserved syncmer CLT may therefore provide a model for estimating sampling errors for other Mash containment indexes.

In anticipation of actual code usage, Section 3 substitutes the sequence length L + k−1 for the k-mer count L in both input and output, but if k is a few percent of L + k − 1, the conflation is harmless. Section 3 showed that in specific cases, the syncmer CLT for L provides reasonable approximations if L + k − 1 is more than about 10⁴. In general also, the syncmer CLT for length from Eq (3) likely requires a syncmer count ${| S (B) |}_{1}$ of at least about 20.

On one hand, if the mutated sequence B is represented by unassembled reads, ideally the reads should cover B without gaps, particularly if the syncmer CLT for L is to be applied. Otherwise, if g > 0 counts the gaps, the k-mer count L could differ noticeably from the covered length L + (g + 1; k − 1). In contrast, we expect the conserved syncmer CLT for the mutation probability θ from Eq (5) to be relatively robust against gapped coverage because it compares reference and mutated sequences with syncmer counts ${| S (A) \cap S (B) |}_{1}$ and ${| S (B) |}_{1}$ , and the k-mer count L appears only the denominator under a square root.

The conserved syncmer CLT for the mutation probability θ from Eq (5) likely requires a conserved syncmer count ${| S (A) \cap S (B) |}_{1}$ of at least about 20. The restriction on conserved syncmers can readily be understood, as follows. If a k-mer is conserved, then all its k letters must be conserved (probability ${\bar{θ}}^{k}$ ). Thus, even modest mutation probabilities $θ = 1 - \bar{θ}$ can mutate nearly all k-mers, making the near-zero conserved k-mer count relatively insensitive to changes in θ.

On the other hand, for complete k-mer sketches, which contain every k-mer within a sequence, indel restrictions in the simple mutation model can be relaxed by observing that (k − 1)-mers are nested inside of k-mers (Röhling et al., 2020). The nesting property permits edge effects to estimate indel counts. Like k-mers, mid-open syncmers also have a nesting property. For $k - s + 1 > 3$ , upon removal of one of the end-nucleotides, a mid-open (k,s)-syncmer becomes a mid-open (k−1,s)-syncmer. Techniques that apply the nesting property to complete k-mer sketches may therefore extend to mid-open syncmer sketches to relax restrictions on indels.

Simulations in Section 3 suggest that some qualitative constraints must be satisfied before applying our CLTs. The syncmer CLT for the k-mer count L from Eq (3) requires a syncmer count ${| S (A) |}_{1}$ of at least about 20. Moreover, the conserved syncmer CLT for the mutation probability θ from Eq (5) requires a conserved syncmer count ${| S (A \cap B) |}_{1}$ of at least about 20. The restriction on conserved syncmers can readily be understood, as follows. If a k-mer is conserved, then all its k letters must be conserved (probability ${\bar{θ}}^{k}$ ). Thus, even modest mutation probabilities $θ = 1 - \bar{θ}$ can mutate nearly all k-mers, making the near-zero conserved k-mer count relatively insensitive to changes in θ.

Our CLTs require that after postprocessing of raw sequencer output, the syncmer multiset $S (B)$ for the mutated sequence B (imagined after read assembly, if necessary) contain relatively few replicates (k-mer sequences x with multiplicities $m_{B} (x) > 1$ ). To avoid replicates: (1) the k-mers must be long enough to ensure uniqueness when matched; and (2) the k- and s-mer hashing functions $o_{k}$ and $o_{s}$ need to avoid mapping repetitive sequences onto syncmers. Thus, a hashing function $o_{s}$ can and should avoid uninteresting (repetitive) sequences, for example, by mapping uninteresting s-mers onto large values.

In Section 3, the inaccuracy of the length estimated for the C. porosus mitochondrion likely reflects simple repeats. The SI shows that for mid-open (k, s) = (13,7)-syncmers, only four taxa including C. porosus contained more than 20 replicate syncmers. Thus, the syncmer CLT for L can detect length discrepancies from syncmer counts alone, without assembling reads, and flag them for further investigation. Outside the four taxa, k-mers were typically long enough to provide syncmer uniqueness. The repeat program tantan (version 40) identified a repetitive simple sequence at the end of the anomalous C. porosus mitochondrion as a possible source of replicate syncmers.

Our analysis of overlapping syncmers neglects s-minimizer ties, so it implicitly assumes long s-mers (e.g., see for contrast s = 2 in Fig. 2). In addition, our CLTs provide continuous approximations to discrete distributions. Thus, they likely require s-minimizer offsets to vary almost continuously within k-mers, that is, they require each k-mer to contain many s-mers (e.g., see for contrast L = 10⁵, k = 10, and s = 6 in Fig. 2).

In summary, by necessity our CLTs place restrictions on syncmers. The main article above states several of these restrictions qualitatively, but the SI sharpens some of them into quantitative restrictions.

The present article aimed to extend previous CLTs for the complete and MinHash sketches (Blanca et al., 2022) for use with other submer sketches (Edgar, 2021). It focused on mid-open syncmers, mostly because of their optimality properties (Shaw and Yu, 2022), but its results hold generally, for parametrized syncmers with downsampling. Some techniques here likely also extend to other submers and their short-range correlations.

The SI proves our CLTs by analyzing syncmer overlaps under a probability model for random hashes. Even without foundations from probability theory, the techniques could likely be extended to other concepts like universal hitting sets, polar sets, etc., by analyzing the empirical overlaps between submers. Regardless, submers based on random hashing functions confer an additional layer of randomness on probability models of DNA. Sequence probability models often cannot capture unknown correlations in DNA, so the extra layer of randomization from hashing functions may improve the accuracy of the probability models.

Footnotes

ACKNOWLEDGMENT

The authors thank Jim Shaw and Yun William Yu for useful conversations.

AUTHORS’ CONTRIBUTIONS

J.L.S.: Conceptualization, formal analysis, methodology, software, supervision, visualization, and writing–review and editing; P.D.: Software, validation, visualization, and writing–original draft; Y.C.: Investigation, methodology, software, and validation; M.F.: Methodology, supervision, and writing–review and editing.

DATA AND SOFTWARE SHARING

The package of hypothesis tests and confidence intervals can be found at the (abbreviated) URL .

AUTHOR DISCLOSURE STATEMENT

The authors have received no assistance for their work from outside their academic institutions and have no patents or copyrights relevant to the work in the article.

FUNDING INFORMATION

This research was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.

SUPPLEMENTARY MATERIAL

References

Blanca

, Harris

, Koslicki

, et al. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. J Comput Biol, 2022; 29(2):155–168.

Corder

, Foreman

. Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach. Wiley: New York; 2011.

Dutta

, Pellow

, Shamir

. Parameterized syncmer schemes improve long-read mapping. PLoS Comput Biol, 2022; 18(10):e1010638; doi: 10.1371/journal.pcbi.1010638

Edgar

. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 2021; 9:e10805.

Edgar

, Taylor

, Lin

, et al. Petabase-scale sequence alignment catalyses viral discovery. Nature, 2022; 602(7895):142–147; doi: 10.1038/s41586-021-04332-2

Fisher

. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd:; 1925.

Frith

, Noé

, Kucherov

, et al. Minimally-overlapping words for sequence similarity search. Bioinformatics, 2021; 36(22–23):5344–5350.

Harris

, Medvedev

. Improved representation of sequence bloom trees. Bioinformatics, 2020; 36(3):721–727.

Hoeffding

, Robbins

. The central limit theorem for dependent random variables. In: The Collected Works of Wassily Hoeffding. ( Fisher

, Sen

. eds.) Springer New York: New York, NY; 1994; pp. 205–213.

10.

. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 2018; 34(18):3094–3100; doi: 10.1093/bioinformatics/bty191

11.

Manber

. Finding similar files in a large file system. In: Winter USENIX Technical Conference. Usenix Winter: San Francisco CA USA, 1994; pp. 1–10.

12.

Morgenstern

, Zhu

, Horwege

, et al. Estimating evolutionary distances from spaced-word matches. In: 14th International Workshop, WABI ( Brown

, Morgenstern

eds.) Springer-Verlag: 2014; pp. 161–173.

13.

Morgenstern

, Zhu

, Horwege

, et al. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol, 2015; 10:5; doi: 10.1186/s13015-015-0032-x

14.

Ondov

, Starrett

, Sappington

, et al. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol, 2019; 20(1):232; doi: 10.1186/s13059-019-1841-x

15.

Ondov

, Treangen

, Melsted

, et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biol, 2016; 17(1):132.

16.

Orenstein

, Pellow

, Marçais

, et al. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput Biol, 2017; 13(10):e1005777.

17.

Roberts

, Hayes

, Hunt

, et al. Reducing storage requirements for biological sequence comparison. Bioinformatics, 2004; 20(18):3363–3369.

18.

Röhling

, Linne

, Schellhorn

, et al. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One, 2020; 15(2):e0228070.

19.

Sarmashghi

, Bohmann

, P Gilbert

, et al. Skmer: Assembly-free and alignment-free sample identification using genome skims. Genome Biol, 2019; 20(1):34; doi: 10.1186/s13059-019-1632-4

20.

Schleimer

, Wilkerson

, Aiken

. Winnowing: Local algorithms for document fingerprinting. In: SIGMOD 2003. ACM: San Diego CA; 2003; pp. 76–85.

21.

Schmidt

, Hildebrandt

. Next-generation sequencing: Big data meets high performance computing. Drug Discov Today, 2017; 22(4):712–717; doi: 10.1016/j.drudis.2017.01.014

22.

Shaw

, Yu

. Theory of local k-mer selection with applications to long-read alignment. Bioinformatics, 2022; 38(20):4659–4669; doi: 10.1093/bioinformatics/btab790

23.

Solomon

, Kingsford

. Fast search of thousands of short-read sequencing experiments. Nat Biotechnol, 2016; 34(3):300–302.

24.

Sommer

, Delcher

, Salzberg

, et al. Minimus: A fast, lightweight genome assembler. BMC Bioinformatics, 2007; 8(1):64; doi: 10.1186/1471-2105-8-64

25.

Song

, Ren

, Reinert

, et al. New developments of alignment-free sequence comparison: Measures, statistics and next-generation sequencing. Brief Bioinform, 2014; 15(3):343–353.

26.

Wilson

. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc, 1927; 22(158):209–212; doi: 10.2307/2276774

27.

Wood

, Lu

, Langmead

. Improved metagenomic analysis with Kraken 2. Genome Biol, 2019; 20(1):257.

28.

Wood

, Salzberg

. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol, 2014; 15(3):R46.

29.

, Ma

, Cannon

, et al. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics, 2012; 13 (Suppl 6):S1; doi: 10.1186/1471-2105-13-s6-s1

30.

Zheng

, Kingsford

, Marçais

. Sequence-specific minimizers via polar sets. Bioinformatics, 2021; 37(Suppl_1):i187–i195.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.58 MB

0.12 MB

The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches

Abstract

Introduction:

Methods:

Results:

Conclusions:

1. INTRODUCTION

2. METHODS

2.1. The simple mutation model—prior work

2.2. Parametrized syncmers—prior work

2.2.1. k-Mer order

2.2.2. Syncmer sketches

2.3. The Gaussian approximation for the k-mer count L

2.6. Estimates with sampling errors for mitochondrial lengths and mutation probabilities

3. RESULTS

3.1. Simulated CIs for the k-mer count L

Footnotes

ACKNOWLEDGMENT

AUTHORS’ CONTRIBUTIONS

DATA AND SOFTWARE SHARING

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

SUPPLEMENTARY MATERIAL

References

Supplementary Material