Metagenomic Classification Using an Abstraction Augmented Markov Model

Abstract

The abstraction augmented Markov model (AAMM) is an extension of a Markov model that can be used for the analysis of genetic sequences. It is developed using the frequencies of all possible consecutive words with same length (p-mers). This article will review the theory behind AAMM and apply the theory behind AAMM in metagenomic classification.

Get full access to this article

View all access options for this article.

References

Almeida

J.S.

, and Vinga

2002. Universal sequence map (USM) of aribitrary discrete sequences. BMC Bioinform., 3, 6.

Altschul

S.F.

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

Blaisdell

B.E.

1986. A measure of similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. U.S.A., 83, 5155–5159.

Blaisdell

B.E.

1989. Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences. J. Mol. Evol., 29, 526–537.

Caragea

2009. Abstraction-based probabilistic models for sequence classification. Department of Computer Science, Iowa State University.

Caragea

, Silvescu

, Caragea

, and Honavar

2009. Abstraction augmented Markov models. NIPS Workshop on Machine Learning in Computational Biology. Vancouver, BC, Canada.

Case

R.J.

, Boucher

, Dahllof

, et al. 2007. Use of 16s rrna and rpob genes as molecular markers for microbial ecology studies. Appl. Environ. Microbiol., 73, 278–288.

Charfreitag

, and Stackebrandt

1989. Inter- and intragenic relationships of the genus propionbacterium as determined by 16s rrna sequences. Microbiology, 135, 2065–2070.

Clemente

J.C.

, Jansson

, and Valiente

2011. Flexible taxonomic assignment of ambiguous sequencing reads. BMC Bioinform., 12, 8.

10.

Consortium

I.H.G.S.

2001. Initial sequencing and analysis of the human genome. Nature, 409, 860–921.

11.

DeSantis

, Hugenholtz

, Larsen

, et al. 2008. Greengenes, a chimera-checked 16s rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol., 72, 5069–5072.

12.

Dunham

M.H.

, Meng

, and Huang

2004. Extensible Markov model, 371–374. In Proceedings of the Fourth IEEE International Conference on Data Mining. IEEE, New York.

13.

Enright

A.J.

, Iliopoulos

, Kyrpides

N.C.

, and Ouzounis

C.A.

1999. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 86–90.

14.

Fichant

, and Gautier

1987. Statistical method for predicting protein coding regions in nucleic acid sequences. Comput. Appl. Biosci., 3, 287–295.

15.

Fuhrman

J.A.

2012. Metagenomics and its connection to microbial community organization. F1000 Biol. Rep., 4, 15.

16.

Gibbs

A.J.

, Dale

, Kinns

, and MacKenzie

1971. The transition matrix method for comparing sequences: Its use in describing and classifying proteins by their amino acid sequences. Syst. Zool., 20, 417–425.

17.

Hahsler

, and Nagar

2014. Quasialign: Infrastructure for quasi-alignment of genetic sequences. R Package Version 0.0-4.

18.

Hide

, Burke

, and Davison

D.B.

1994. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Comput. Biol., 1, 199–215.

19.

Hugenholtz

, Goebel

B.M.

, and Pace

N.R.

1998. Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J. Bacteriol., 180, 6793.

20.

Huson

D.L.

, Auch

A.F.

, Qi

, and Schuster

S.C.

2007. Megan analysis of metagenomic data. Genome Res., 17, 377–386.

21.

Karlin

, and Altschul

1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A., 87, 2264–2268.

22.

Kim

, Lee

K.-H.

, Yoon

S.-W.

, et al. 2013. Analytical tools and databases for metagenomics in the next-generation sequencing era. Genomics Informat., 11, 102–113.

23.

Kotamarti

R.M.

, Hahsler

, Raiford

, et al. 2010. Analyzing classification using extensible Markov models. Bioinformatics, 26, 2235–2241.

24.

, Badger

J.H.

, Chen

, et al. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17, 149–154.

25.

Lin

1991. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory, 37, 145–151.

26.

Nagar

2013. A quasi-alignment based framework for fast discovery of conserved regions and classification of DNA fragments [Ph.D. dissertation]. Southern Methodist University.

27.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

28.

Smith

T.F.

, and Waterman

M.S.

1981. Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197.

29.

Solovyev

V.V.

, and Makarova

K.S.

1993. A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization. Bioinformatics, 9, 17–24.

30.

Staden

1979. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res., 6, 2601–2610.

31.

Stuart

G.W.

, Moffett

, and Baker

2002. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics, 18, 100–108.

32.

Torney

D.C.

, Burks

, Davison

D.B.

, and Sirotkin

K.M.

1990. A simple measure of sequence divergence. Technical Report LAUR 89-946, Los Alamos National Laboratory.

33.

Turnbaugh

P.J.

, Ley

R.E.

, Harnady

, et al. 2007. The Human Microbiome Project: Exploring the microbial part of ourselves in a changing world. Nature, 449, 804–810.

34.

van Heel

1991. A new family of powerful multivariate statistical sequence analysis techniques. J. Mol. Biol., 220, 877–887.

35.

Wang

, Garrity

G.M.

, Tiedje

J.M.

, and Cole

J.R.

2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol., 73, 5261–5267.

36.

Weisburg

W.G.

, Barns

S.M.

, Pelletier

D.A.

, and Lane

D.J.

1991. 16s ribosomal DNA amplification for phylogenetic study. J. Bacteriol., 173, 697–703.

37.

T.-J.

, Burke

, and Davison

D.B.

1997. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics, 53, 1431–1439.

38.

T.-J.

, Hsieh

Y.-C.

, and Li

L.-A.

2001. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics, 57, 441–448.

39.

Yoon

B.-J.

2009. Hidden Markov models and their applications in biological sequence analysis. Curr. Genomics, 10, 402–415.

40.

Zhu

X.S.

2014. Comparison of quasi-alignment methods for metagenomic classification. Department of Statistical Science, Southern Methodist University.