The abstraction augmented Markov model (AAMM) is an extension of a Markov model that can be used for the analysis of genetic sequences. It is developed using the frequencies of all possible consecutive words with same length (p-mers). This article will review the theory behind AAMM and apply the theory behind AAMM in metagenomic classification.
Get full access to this article
View all access options for this article.
References
1.
AlmeidaJ.S., and VingaS.2002. Universal sequence map (USM) of aribitrary discrete sequences. BMC Bioinform., 3, 6.
2.
AltschulS.F., GishW., MillerW., et al.1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
3.
BlaisdellB.E.1986. A measure of similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. U.S.A., 83, 5155–5159.
4.
BlaisdellB.E.1989. Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences. J. Mol. Evol., 29, 526–537.
5.
CarageaC.2009. Abstraction-based probabilistic models for sequence classification. Department of Computer Science, Iowa State University.
6.
CarageaC., SilvescuA., CarageaD., and HonavarV.2009. Abstraction augmented Markov models. NIPS Workshop on Machine Learning in Computational Biology. Vancouver, BC, Canada.
7.
CaseR.J., BoucherY., DahllofI., et al.2007. Use of 16s rrna and rpob genes as molecular markers for microbial ecology studies. Appl. Environ. Microbiol., 73, 278–288.
8.
CharfreitagO., and StackebrandtE.1989. Inter- and intragenic relationships of the genus propionbacterium as determined by 16s rrna sequences. Microbiology, 135, 2065–2070.
9.
ClementeJ.C., JanssonJ., and ValienteG.2011. Flexible taxonomic assignment of ambiguous sequencing reads. BMC Bioinform., 12, 8.
10.
ConsortiumI.H.G.S.2001. Initial sequencing and analysis of the human genome. Nature, 409, 860–921.
11.
DeSantisT., HugenholtzP., LarsenN., et al.2008. Greengenes, a chimera-checked 16s rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol., 72, 5069–5072.
12.
DunhamM.H., MengY., and HuangJ.2004. Extensible Markov model, 371–374. In Proceedings of the Fourth IEEE International Conference on Data Mining. IEEE, New York.
13.
EnrightA.J., IliopoulosI., KyrpidesN.C., and OuzounisC.A.1999. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 86–90.
14.
FichantG., and GautierC.1987. Statistical method for predicting protein coding regions in nucleic acid sequences. Comput. Appl. Biosci., 3, 287–295.
15.
FuhrmanJ.A.2012. Metagenomics and its connection to microbial community organization. F1000 Biol. Rep., 4, 15.
16.
GibbsA.J., DaleM., KinnsH., and MacKenzieH.1971. The transition matrix method for comparing sequences: Its use in describing and classifying proteins by their amino acid sequences. Syst. Zool., 20, 417–425.
17.
HahslerM., and NagarA.2014. Quasialign: Infrastructure for quasi-alignment of genetic sequences. R Package Version 0.0-4.
18.
HideW., BurkeJ., and DavisonD.B.1994. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Comput. Biol., 1, 199–215.
19.
HugenholtzP., GoebelB.M., and PaceN.R.1998. Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J. Bacteriol., 180, 6793.
20.
HusonD.L., AuchA.F., QiJ., and SchusterS.C.2007. Megan analysis of metagenomic data. Genome Res., 17, 377–386.
21.
KarlinS., and AltschulS.1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A., 87, 2264–2268.
22.
KimM., LeeK.-H., YoonS.-W., et al.2013. Analytical tools and databases for metagenomics in the next-generation sequencing era. Genomics Informat., 11, 102–113.
23.
KotamartiR.M., HahslerM., RaifordD., et al.2010. Analyzing classification using extensible Markov models. Bioinformatics, 26, 2235–2241.
24.
LiM., BadgerJ.H., ChenX., et al.2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17, 149–154.
25.
LinJ.1991. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory, 37, 145–151.
26.
NagarA.2013. A quasi-alignment based framework for fast discovery of conserved regions and classification of DNA fragments [Ph.D. dissertation]. Southern Methodist University.
27.
R Core Team. 2014. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria.
28.
SmithT.F., and WatermanM.S.1981. Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197.
29.
SolovyevV.V., and MakarovaK.S.1993. A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization. Bioinformatics, 9, 17–24.
30.
StadenR.1979. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res., 6, 2601–2610.
31.
StuartG.W., MoffettK., and BakerS.2002. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics, 18, 100–108.
32.
TorneyD.C., BurksC., DavisonD.B., and SirotkinK.M.1990. A simple measure of sequence divergence. Technical Report LAUR 89-946, Los Alamos National Laboratory.
33.
TurnbaughP.J., LeyR.E., HarnadyM., et al.2007. The Human Microbiome Project: Exploring the microbial part of ourselves in a changing world. Nature, 449, 804–810.
34.
van HeelM.1991. A new family of powerful multivariate statistical sequence analysis techniques. J. Mol. Biol., 220, 877–887.
35.
WangQ., GarrityG.M., TiedjeJ.M., and ColeJ.R.2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol., 73, 5261–5267.
36.
WeisburgW.G., BarnsS.M., PelletierD.A., and LaneD.J.1991. 16s ribosomal DNA amplification for phylogenetic study. J. Bacteriol., 173, 697–703.
37.
WuT.-J., BurkeJ., and DavisonD.B.1997. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics, 53, 1431–1439.
38.
WuT.-J., HsiehY.-C., and LiL.-A.2001. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics, 57, 441–448.
39.
YoonB.-J.2009. Hidden Markov models and their applications in biological sequence analysis. Curr. Genomics, 10, 402–415.
40.
ZhuX.S.2014. Comparison of quasi-alignment methods for metagenomic classification. Department of Statistical Science, Southern Methodist University.