Sage Journals: Discover world-class research

Abstract

This article is about the assessment of several tools for k-mer counting, with the purpose to create a reference framework for bioinformatics researchers to identify computational requirements, parallelizing, advantages, disadvantages, and bottlenecks of each of the algorithms proposed in the tools. The k-mer counters evaluated in this article were BFCounter, DSK, Jellyfish, KAnalyze, KHMer, KMC2, MSPKmerCounter, Tallymer, and Turtle. Measured parameters were the following: RAM occupied space, processing time, parallelization, and read and write disk access. A dataset consisting of 36,504,800 reads was used corresponding to the 14th human chromosome. The assessment was performed for two k-mer lengths: 31 and 55. Obtained results were the following: pure Bloom filter-based tools and disk-partitioning techniques showed a lesser RAM use. The tools that took less execution time were the ones that used disk-partitioning techniques. The techniques that made the major parallelization were the ones that used disk partitioning, hash tables with lock-free approach, or multiple hash tables.

Get full access to this article

View all access options for this article.

References

Abouelhoda

M.I.

, Kurtz

, and Ohlebusch

2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2, 53–86.

Audano

, and Vannberg

2014. KAnalyze: A fast versatile pipelined K-mer toolkit. Bioinformatics, 30, 2070–2072.

Bloom

B.H.

1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13, 422–426.

Cormode

, and Muthukrishnan

2005. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55, 58–75.

Deorowicz

, Kokot

, Grabowski

, and Debudaj-Grabysz

2015. KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics, 31, 1569–1576.

Kurtz

, Narechania

, Stein

J.C.

, and Ware

2008. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9, 517.

Ladan-Mozes

, and Shavit

2008. An optimistic approach to lock-free fifo queues. Distrib. Comput., 20, 323–341.

2015. MSPK-merCounter: A fast and memory efficient approach for k-mer counting. arXiv preprint. arXiv:1505.06550.

, Kamousi

, Han

, et al. 2013. Memory efficient minimum substring partitioning. Proc. VLDB Endow., 6, 169–180.

10.

Marçais

, and Kingsford

2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770.

11.

Melsted

, and Pritchard

J.K.

2011. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12, 333.

12.

Rizk

, Lavenier

, and Chikhi

2013. DSK: k-mer counting with very low memory usage. Bioinformatics, 29, 652–653.

13.

Roberts

, Hunt

B.R.

, Yorke

J.A.

, et al. 2004. A preprocessor for shotgun assembly of large genomes. J. Comput. Biol., 11, 734–752.

14.

Roy

R.S.

, Bhattacharya

, and Schliep

2013. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics, 29, 652–653.

15.

Zhang

, Pell

, Canino-Koning

, et al. 2014. These are not the k-mers you are looking for: Efficient online k-mer counting using a probabilistic data structure. PLoS One, 9, e101271.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

Computational Performance Assessment of k-mer Counting Algorithms

Abstract

Abstract

Get full access to this article

References

Supplementary Material