Sage Journals: Discover world-class research

Abstract

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D₂ statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D₂ statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D₂ distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D₂ statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D₂ distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D₂ distribution from the human genome.

Get full access to this article

View all access options for this article.

References

Blaisdell

1986. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. U.S.A., 83, 5155–5159.

Burden

C.J.

, Kantorovitz

M.R.

, and Wilson

S.R.

2008. Approximate word matches between two random sequences. Ann. Appl. Probab., 18, 1–21.

Burden

C.J.

, Jing

, Forêt

, and Wilson

S.R.

2012. Application of k-word match statistics to the clustering of proteins with repeated domains. In Colubi

, Fokianos

, Kontoghiorghes

, and González-Rodríguez

, eds. Proceedings of COMPSTAT 2012, 20th International Conference on Computational Statistics. 131–142.

Burden

C.J.

, Jing

, and Wilson

S.R.

2012. Alignment-free sequence comparison for biologically realistic sequences of moderate length. Stat. Appl. Genet. Mol. Biol., 11, Article 3.

Chor

, Horn

, Goldman

, et al. 2009a. Genomic DNA k-mer spectra: models and modalities. Genome Biol., 10, R108.

Chor

, Horn

, Goldman

, et al. 2009b. k-mer analysis of multiple genomes. Available at www.ebi.ac.uk/goldman-srv/ChorEtAlSpectra/Spectra/HumanChromosomes/chr1/.

Csűrös

, Noé

, and Kucherov

2007. Reconsidering the significance of genomic word frequencies. Trends Genet., 23, 543–546.

ENCODE Project Consortium, Bernstein

B.E.

, Birney

, Dunham

. et al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74.

Forêt

2012. Sequence alignment-free tool. Available at https://github.com/sylvainforet/saft.

10.

Forêt

, Kantorovitz

M.R.

, and Burden

C.J.

2006. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics, 7 Suppl 5, S21.

11.

Forêt

, Wilson

S.R.

, and Burden

C.J.

2009a. Characterizing the D2 statistic: word matches in biological sequences. Stat. Appl. Genet. Mol. Biol., 8, Article 43.

12.

Forêt

, Wilson

S.R.

, and Burden

C.J.

2009b. Empirical distribution of k-word matches in biological sequences. Pattern Recognit., 42, 539–548.

13.

Göke

, Schulz

, Lasserre

, and Vingron

2012. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics, 28, 656–663.

14.

Hide

, Burke

, and Davison

D.B.

1994. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Comput. Biol., 1, 199–215.

15.

Jing

, Wilson

S.R.

, and Burden

C.J.

2011. Weighted k-word matches: a sequence comparison tool for proteins. ANZIAM J., 52 (CTAC2010), 172–189.

16.

Kantorovitz

M.R.

, Booth

H.S.

, Burden

C.J.

, and Wilson

S.R.

2006. Asymptotic behavior of k-word matches between two uniformly distributed sequences. J. Appl. Probab., 44, 788–805.

17.

Kantorovitz

M.R.

, Robinson

G.E.

, and Sinha

2007. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics, 23, i249–i255.

18.

Knuth

D.E.

1981. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 2nd ed. Addison-Wesley, Reading, MA.

19.

Lippert

R.A.

, Huang

, and Waterman

M.S.

2002. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. U.S.A., 99, 13980–13989.

20.

Percus

, and Percus

2006. The statistics of words on rings. Commun. Pure Applied Math., 59, 145–160.

21.

R Core Development Team. 2012. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at www.R-project.org.

22.

Reinert

, and Schbath

1998. Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol., 5, 223–253.

23.

Reinert

, Schbath

, and Waterman

2005. Statistics on words with applications to biological sequences. In Lothaire

, ed., Applied Combinatorics on Words, Chapter 6. Cambridge University Press, Cambridge.

24.

Reinert

, Chew

, Sun

, and Waterman

M.S.

2009. Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol., 16, 1615–1634.

25.

Snedecor

G.W.

, and Cochran

W.G.

1980. Statistical Methods, 7th ed. Iowa State University Press, Ames, IA.

26.

Stuart

, Moffett

, and Baker

2002. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics, 18, 100–108.

27.

Stuart

, Moffett

and Leader

2002. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol. Biol. Evol., 19, 554–562.

28.

Torney

, Burks

, Davison

, and Sirotkin

1990. Computation of d². A measure of sequence dissimilarity, 109–125. In Bell

, and Mrarr

, eds. Computers and DNA, Santa Fe Institute Studies in the Sciences of Complexity. Addison-Wesley, New York.

29.

Vinga

, and Almeida

2003. Alignment-free sequence comparison—a review. Bioinformatics, 19, 513–523.

30.

Wan

, Reinert

, Sun

, and Waterman

M.S.

2010. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J. Comput. Biol., 17, 1467–1490.

31.

Waterman

M.S.

1995. Introduction to Computational Biology. Chapman and Hall, London.

32.

Wellcome Trust Sanger Institute and European Bioinformatics Institute. 2012. Ensembl Genome Browser. Homo Sapiens DNA. Available at ftp.ensembl.org/pub/release-68/fasta/homo_sapiens/dna/, file Homo_sapiens.GRCh37.68.dna_sm.chromosome.1.fa.gz.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.01 MB

0.00 MB

The Distribution of Word Matches Between Markovian Sequences with Periodic Boundary Conditions

Abstract

Abstract

Get full access to this article

References

Supplementary Material