Sage Journals: Discover world-class research

Abstract

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Get full access to this article

View all access options for this article.

References

Bankevich

, Nurk

, Antipov

, et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477.

Broder

A.Z.

1997. On the resemblance and containment of documents, pp. 21–29. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, Salerno, Italy.

Brown

C.T.

, Olm

M.R.

, Thomas

B.C.

, and Banfield

J.F.

2016. Measurement of bacterial replication rates in microbial communities. Nat. Biotechnol. 34, 1256.

Brown

L.D.

, Cai

T.T.

, and DasGupta

2001. Interval estimation for a binomial proportion. Stat. Sci. 16, 101–117.

Burden

C.J.

, Leopardi

, and Forêt

2014. The distribution of word matches between markovian sequences with periodic boundary conditions. J. Comput. Biol. 21, 41–63.

Casella

, and Berger

R.L.

2002. Statistical Inference, Vol. 2. Duxbury Pacific Grove, CA.

Denti

, Previtali

, Bernardini

, et al. 2019. MALVA: Genotyping by Mapping-free ALlele detection of known VAriants. iScience, 18, 20–27.

Fan

, Ives

A.R.

, Surget-Groba

, and Cannon

C.H.

2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics, 16, 522.

Gusfield

1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.

10.

Harris

R.S.

, and Medvedev

2020. Improved representation of sequence bloom trees. Bioinformatics, 36, 721–727.

11.

Haubold

, Pfaffelhuber

, Domazet-Loso

, and Wiehe

2009. Estimating mutation distances from unaligned genomes. J. Comput. Biol. 16, 1487–1500.

12.

Hoeffding

, Robbins

, et al. 1948. The central limit theorem for dependent random variables. Duke Math. J. 15, 773–780.

13.

Jain

, Dilthey

, Koren

, et al. 2017. A fast approximate algorithm for mapping long reads to large reference databases, pp. 66–81. In: International Conference on Research in Computational Molecular Biology. Ed: S. Cenk Sahinalp. Springer, Hongkong.

14.

Lander

E.S.

, and Waterman

M.S.

1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 2, 231–239.

15.

2018. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100.

16.

Y.Y.

, Tang

, Ren

, et al. 2017. Cafe: Accelerated alignment-free sequence analysis. Nucleic Acids Res. 45(W1), W554–W559.

17.

Miao

, and Gastwirth

J.L.

2004. The effect of dependence on confidence intervals for a population proportion. Am. Stat. 58, 124–130.

18.

Miclotte

, Heydari

, Demeester

, et al. 2016. Jabba: Hybrid error correction for long sequencing reads. Algorithms Mol. Biol. 11, 1–12.

19.

Morgenstern

, Zhu

, Horwege

, and Leimeister

C.A.

2015. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol. Biol. 10, 5.

20.

Ondov

B.D.

, Starrett

G.J.

, Sappington

, et al. 2019. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol. 20, 232.

21.

Ondov

B.D.

, Treangen

T.J.

, Melsted

, et al. 2016. Mash: Fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132.

22.

Reinert

, Chew

, Sun

, and Waterman

M.S.

2009. Alignment-free sequence comparison (i): Statistics and power. J. Comput. Biol. 16, 1615–1634.

23.

Röhling

, Linne

, Schellhorn

, et al. 2020. The number of k-mer matches between two dna sequences as a function of k and applications to estimate phylogenetic distances. PLos One, 15, e0228070.

24.

Ross

2011. Fundamentals of Stein's method. Probab. Surv. 8, 210–293.

25.

Salmela

, Walve

, Rivals

, and Ukkonen

2017. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics, 33, 799–806.

26.

Sarmashghi

, Bohmann

, Gilbert

M.T.P.

, et al. 2019. Skmer: Assembly-free and alignment-free sample identification using genome skims. Genome Biol. 20, 1–20.

27.

Schwengers

, Hain

, Chakraborty

, and Goesmann

2019. Referenceseeker: Rapid determination of appropriate reference genomes. BioRxiv. Vol. 863621.

28.

Solomon

, and Kingsford

2016. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302.

29.

Song

, Ren

, Reinert

, et al. 2014. New developments of alignment-free sequence comparison: Measures, statistics and next-generation sequencing. Briefings Bioinf. 15, 343–353.

30.

Standage

D.S.

, Brown

C.T.

, and Hormozdiari

2019. Kevlar: A mapping-free framework for accurate discovery of de novo variants. bioRxiv. Vol. 549154.

31.

Sun

, and Medvedev

2018. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics, 35, 415–420.

32.

Tang

, Liu

, Zhang

, et al. 2019. Sketch distance-based clustering of chromosomes for large genome database compression. BMC Genomics, 20, 1–9.

33.

Wang

, and Au

K.F.

2020. Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads. Genome Biol. 21, 14.

34.

Wasserman

2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media, Berlin, Germany.

35.

Wilson

E.B.

1927. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.

36.

Wood

D.E.

, and Salzberg

S.L.

2014. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46.

37.

T.-J.

, Huang

Y.-H.

, and Li

L.-A.

2005. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics, 21, 4125–4132.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

2.87 MB

0.00 MB

The Statistics of k -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Abstract

Get full access to this article

References

Supplementary Material