Abstract
Universal hitting sets (UHS) are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between UHS and minimizer schemes, where minimizer schemes with low density (i.e., efficient schemes) correspond to UHS of small size. Local schemes are a generalization of minimizer schemes that can be used as replacement for minimizer scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a UHS. We give bounds for the remaining path length of the Mykkeltveit UHS. In addition, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound.
1. Introduction
We study the problem of finding Universal Hitting Sets (UHS) (Orenstein et al., 2016). A UHS is a set of words, each of length k, such that every long enough string (say of length L or longer) contains as a substring element from the set. We call such a set a UHS for parameters k and L. They are sets of unavoidable words, that is, words that must be contained in any long strings, and we are interested in the relationship between the size of these sets and the length L.
More precisely, we say that a k-mer a (a string of length k) hits a string S if a appears as a substring of S. A set A of k-mers hits S if at least one k-mer of A hits S. A UHS for length L is a set of k-mers that hits every string of length L. Equivalently, the remaining path length of a universal set is the length of the longest string that is not hit by the set (
The study of UHS is motivated, in part, by the link between UHS and the common method of minimizers (Schleimer et al., 2003; Roberts et al., 2004a,b). The minimizer method is a way to sample a string for representative k-mers in a deterministic way by breaking a string into windows, each window containing w k-mers, and selecting in each window a particular k-mer (the “minimum k-mer,” as defined by a preset order on the k-mers). This method is used in many bioinformatic software programs (Ye et al., 2012; Grabowski and Raniszewski, 2013; Chikhi et al., 2015; Deorowicz et al., 2015; Jain et al., 2017) to reduce the amount of computation and improve run time (see Marçais et al., 2019 for usage examples). The minimizer method is a family of methods parameterized by the order on the k-mers used to find the minimum. The density is defined as the expected number of sampled k-mers per unit length of sequence. Depending on the order used, the density varies.
In general, a lower density (i.e., fewer sampled k-mers) leads to greater computational improvements and is therefore desirable. For example, a read aligner such as Minimap2 (Li and Birol, 2018) stores all the locations of minimizers in the reference sequence in a database. It then finds all the minimizers in a read and searches in the database for these minimizers. The locations of these minimizers are used as seeds for the alignment. Using a minimizer scheme with a reduced density leads to a smaller database and fewer locations to consider, hence an increased efficiency, while preserving the accuracy.
There is a two-way correspondence between minimizer methods and UHS: each minimizer method has a corresponding UHS, and a UHS defines a family of compatible minimizer methods (Marçais et al., 2017, 2018). This correspondence also links the remaining path length of a UHS and the window size of a compatible minimizer scheme: the remaining path length of the UHS is upper bounded by the number of bases in each window in the minimizer scheme (
Moreover, the relative size of the UHS, defined as the size of UHS over the number of possible k-mers, provides an upper bound on the density of the corresponding minimizer methods: the density is no more than the relative size of the UHS. Precisely,
Local schemes (Mykkeltveit, 1972) and forward schemes are generalizations of minimizer schemes. These extensions are of interest because they can be used in place of minimizer schemes while sampling k-mers with lower density. In particular, minimizer schemes cannot have density close to the theoretical lower bound of
The previously known link between minimizer schemes and UHS relied on the definition of an ordering between k-mers, and therefore is not valid for local and forward schemes that are not based on any ordering. Nevertheless, UHS play a central role in understanding the density of local and forward schemes.
Our first contribution is to describe the connection between UHS, local and forward schemes. More precisely, there are two connections: first, between the density of the schemes and the relative size of the UHS, and second, between the window size w of the scheme and the remaining path length of the UHS (i.e., the maximum length L of a string that does not contain a word from the UHS). This motivates our study of the relationship between the size of a UHS U and the remaining path length of U.
There is a rich literature on unavoidable word sets (Lothaire, 2002). The setting for UHS is slightly different for two reasons. First, we impose that all the words in the set U have the same length k, as a k-mer is a natural unit in bioinformatic applications. Second, the set U must hit any string of a given finite length L, rather than being unavoidable only by infinitely long strings.
Mykkeltveit (1972) answered the question of what is the size of a minimum unavoidable set with k-mers by giving an explicit construction for such a set. The k-mers in the Mykkeltveit set are guaranteed to be present in any infinitely long sequence, and the size of the Mykkeltveit set is minimum in the sense that for any set
The DOCKS (Orenstein et al., 2016) and ReMuVal (DeBlasio et al., 2019) algorithms are heuristics to generate unavoidable sets for parameters k and L. Both of these algorithms use the Mykkeltveit set as a starting point. In many practical cases, the longest sequence that does not contain any k-mer from the Mykkeltveit set is much larger than the parameter L of interest (which for a compatible minimizer scheme corresponds to the window length). Therefore, the two heuristics extend the Mykkeltveit set to cover every L-long sequence. These greedy heuristics do not provide any guarantee on the size of the unavoidable set generated compared with the theoretical minimum size and are only computationally tractable for limited ranges of k and L.
Our second contribution is to give upper and lower bounds on the remaining path length of the Mykkeltveit sets. These are the first bounds on the remaining path length for minimum size sets of unavoidable k-mers.
Defining local or forward schemes with a density of
Understanding the properties of UHS and their many interactions with selection schemes (minimizer and forward and local schemes) is a crucial step toward designing schemes with lower density and improving the many algorithms using these schemes. In Section 2, we give an overview of the results, and in Section 3, we give detailed proofs. Further research directions are discussed in Section 4.
2. Results
2.1. Notation
2.1.1. Universal hitting sets
Consider a finite alphabet
2.1.2. de Bruijn graphs
Many questions regarding strings have an equivalent formulation with graph terminology using de Bruijn graphs. The de Bruijn graph
There is a one-to-one correspondence between strings and paths in
A de Bruijn sequence is a particular sequence of length
2.1.3. Selection schemes
A local scheme (Schleimer et al., 2003) is a method to select positions in a string. A local scheme is parameterized by a selection function f. It works by looking at every w-mer of the input sequence S:
A forward scheme is a local scheme with a selection function such that the selected positions form a nondecreasing sequence. That is, if
A minimizer scheme is a scheme where the selection function takes in the sequence of w consecutive k-mers and returns the “minimum” k-mer in the window (hence the name minimizers). The minimum is defined by a predefined order on the k-mers (e.g., lexicographic order) and the selection function is
See Figure 1 for examples of all three schemes. The local scheme concept is the most general as it imposes no constraint on the selection function, while a forward scheme must select positions in a nondecreasing way. A minimizer scheme is the least general and also selects positions in a nondecreasing way.

Local and forward schemes were originally defined with a function defined on a window of w k-mers,
There are multiple reasons to consider selection schemes. First, they are slightly simpler as they have only one parameter, namely the window length w. Second, in our analysis, we consider the case where w is asymptotically large, therefore
2.1.4. Density
Because a local scheme on string S may pick the same location in two different windows, the number of selected positions is usually less than
2.2. Main results
The density of a local scheme is in the range
Unfortunately, this simple model properly accounts for the minimizer behavior only when k and w are small. For large k—that is,
This motivates the study of forward schemes and local schemes. It is known that there exist forward schemes with a density of
2.2.1. Connection between UHS and selection schemes
In the study of selection schemes, as for minimizer schemes, UHS play a central role. We describe the link between selection schemes and UHS, and show that the existence of a selection scheme with low density implies the existence of a UHS with a small relative size.
2.2.2. Almost-optimal relative size UHS for linear path length
Conversely, because of their link to forward and local selection schemes, we are interested in UHS with remaining path length
Consequently, the relative size of a UHS, which contains at least one element from each of these cycles, is lower bounded by
2.2.3. Remaining path length bounds for the Mykkeltveit sets
Mykkeltveit (1972) gave an explicit construction for a decycling set with exactly one element from each of the rotation cycles, and thereby proved a long-standing conjecture (Golomb, 2014) that the minimal size of decycling sets is equal to the necklace number. Under the UHS framework, it is natural to ask what the remaining path length for Mykkeltveit sets is. Given that the de Bruijn graph is Hamiltonian, there exist paths of length exponential in w: the Hamiltonian tours have σw vertices. Nevertheless, we show that the remaining path length for Mykkeltveit sets is upper and lower bounded by polynomials of w:
3. METHODS AND PROOFS
For length reasons, several parts of the proof are found in the Supplementary Material.
3.1. UHS from selection schemes
3.1.1. Contexts and densities of selection schemes
We derive another way of calculating densities of selection schemes based on the idea of contexts.
Recall a local scheme is defined as a function
For a location i in sequence S, the context at this location is defined as
The expected density of f is computed as the number of selected positions over the length of the sequence for a random sequence, as the sequence becomes infinitely long. For a sufficiently long random sequence (
3.1.2. UHS from selection schemes
The set
Proof. By contradiction, assume there is a path of length w in the de Bruijn graph of order
Since
This results is also a direct consequence of the definition of
When f is a forward scheme, to determine if a new location is picked in a window, looking back one window is sufficient. This is because if we do not pick a new location, we have to pick the same location as in the last window. This means the context with two w-mers, or as a (w + 1)-mer, is sufficient, and our other arguments involving contexts still hold. Combining the pieces, we prove the following theorem:
3.2. Forbidden word depathing set
3.2.1. Construction and path length
In this section, we construct a set that is
We assume that w is sufficiently large such that
Proof. Let
On the contrary, let
The number of w-mer satisfying clause 1 is
3.2.2. Number of w-mers not containing 0 d
We construct a finite state machine (FSM) that recognizes 0
d
as follows. The FSM consists of
Now, assume we feed a random w-mer to the FSM. The probability that the machine does not reach state “d” for the input w-mer is the relative size of the set of w-mer satisfying clause 2. Denote
Define
Starting with
3.2.3. Bounding
We start by deriving the characteristic polynomial
Proof. The characteristic polynomial of Ad satisfies the following recursive formula, obtained by expanding the determinant over the first column and using the linearity of the determinant:

For
.
Now we fix d and focus on the polynomial
Proof. We show fd has opposite signs on the lower and upper bound of this inequality for sufficiently large d.

For the last line, if
Proof. For the first part, we need to verify 
This verifies
Proof. Let 
This lemma implies that the relative size for the set
3.3. Construction of the Mykkeltveit sets
In this section, we construct the Mykkeltveit set
Intuitively, the position of a w-mer x is defined as the following center of mass. The w roots of unity form a circle in the complex plane, and a weight equal to the value of the base xi is set at the root
Define the successor function
We focus on a particular kind of cycle in the de Bruijn graph. A pure cycle in the de Bruijn graph, also known as conjugacy class, is the sequence of w-mers obtained by repeated rotation:
The embeddings from pure rotations satisfy a curious property:
Proof. By Definition 3 and the definition of successor function 
Note that for pure rotations
The range for

If every w-mer in the class embeds to the origin, pick an arbitrary one.
If there is one w-mer x in the class such that Re(P(x)) <0 and Im(p(x) = 0) (on the negative real axis), pick that one.
Otherwise, pick the unique w-mer x such that Im(p(x) < 0) and Im(P(R(x))) >0. Intuitively, this is the w-mer in the cycle right below the negative real axis.
This set breaks every pure cycle in the de Bruijn graph by its construction, with an interesting property:
Proof. It suffices to show that in the remaining de Bruijn graph after removing If we have Im(P(x)) <0, by clause 3 of Definition 4, If we have Im(P(x)) = 0 and Re(P(x)) <0, by clause 2 of Definition 4, we have If we have Im(P(x)) = Re(P(x)) = 0, we would have Im(P(y)) = Im(P(R(x))) = 0, a contradiction. If we have Im(P(x)) = 0 and Re(P(x)) >0, P(x) lies on positive half of the real axis, so rotating it clockwise by
3.4. Upper bounding the remaining path length in Mykkeltveit sets
In this section, we show that the remaining path after removing
3.4.1. From w-mers to embeddings
In this section, we formulate a relaxation that converts paths of w-mers to trajectories in a geometric space. Precisely, we model Sa in Lemma 7 as a rotation operating on a complex embedding with attached weights, where the weights restrict possible moves.
Formally, given a pair (z, t) where z is a complex number and t an integer, define the family of operations
We are now looking for the length of the longest path by repeated application of
3.4.2. Weight-in embedding and relaxation
The weight-in embedding maps the pair
The
Proof. By definition of weight-in embedding and the operation
In the complex plane, the rotation formula around center c and of angle θ is
Figure 2b shows the weight-in embedding of a de Bruijn graph. The set
Multiple pairs of
Lemma 8 naturally divides any path in the de Bruijn graph avoiding
Proof. The operation is a clockwise rotation where the rotation center is on the x-axis and the two points are on the non-negative half-plane. Necessarily, the real part increased, unless the point is on the fix point of the rotation [which is when
We further relax the problem by allowing rotations from any of the centers in What is the longest path
We now break the problem into smaller stages as the weight-in embedding pass through rotation centers, defined as
We now define the problem of finding the longest path, localized to one left subregion, as follows:
Again, note that this new definition is a purely geometric problem involving no w-mers and no weights
We prove this lemma in Supplementary Section S2.
3.4.3. Backtracking, heights, and local potentials
In this section, we prove
Proof. The key observation is if a rotation is not around the origin,
To see this, assume
Only
This lemma is sufficient to prove
3.5. Lower bounding the remaining path length in Mykkeltveit sets
We provide here a constructive proof of the existence of a
We need an alternative view of w-mers in this section, close to a shift register. Imagine a paper ring with w slots, labeled tag 0 to tag
Let
More precisely, the initial w-mer x is all ones but
Each round involves exactly w + 1 rotations since the last step is to an impure rotation S0 at tag

The correctness proof for the construction is presented in Supplementary Section S4 and the construction for odd w is presented in Supplementary Section S5.
4. Discussion
4.1. Relationship between UHS and selection schemes
Our construction of a UHS of relative size
Unfortunately, this construction does not apply for an arbitrary UHS. In general, given a UHS with relative size d and remaining path length w, it is still unknown how to construct a forward or a local scheme with density
We are thus interested in the following questions. Given a UHS U with relative size d, is it possible to create another UHS
4.2. Existence of “perfect” selection schemes
One of the goals in this research is to confirm or deny the existence of asymptotically “perfect” selection schemes with a density of
4.3. Asymptotic results and practical uses of minimizer schemes
This line of research places more focus on asymptotic densities of minimizers (and naturally, asymptotic densities of local schemes, and asymptotic relative proportion and path length of UHS). That is, we focus on characterization of these quantities in the limit where w and k, the window length (number of k-mers in a window) and the length of a k-mer, go to infinity. On the contrary, for current practices, the values of w and k are relatively small, with w below 100 and k below 30 for the vast majority of use cases. While our results are not immediately useful to analyze these practical scenarios as we do not attempt to determine the constants behind the big-O notation, we believe that further refinement of our approaches can close the gap between theory and practice, and make rigorous analyses possible for practical minimizer and/or local schemes.
4.4. Remaining path length of minimum decycling sets
There is more than one decycling set of minimum size (MDS) for given w. The Mykkeltveit set (Mykkeltveit, 1972) is one possible construction, and a construction based on very different ideas is given in Champarnaud et al. (2004). The number of MDSs is much larger than the two sets obtained by these two methods. Empirically, for small values of w, we can exhaustively search all the MDSs on the binary alphabet: for
While experiments suggest that the longest remaining path in a Mykkeltveit depathing set defined in the original article is around
Footnotes
Author Disclosure Statement
H.Z. declares no competing financial interests. C.K. is a cofounder of Ocean Genomics, Inc. G.M. is V.P. of software development at Ocean Genomics, Inc.
Funding Information
This work was partially supported, in part, by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through grant GBMF4554 to Carl Kingsford, by the U.S. National Science Foundation (CCF-1256087, CCF-1319998), and by the U.S. National Institutes of Health (R01GM122935).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
