Stochastic models of sequence evolution including insertion

Abstract

Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertions-deletions (abbr. indels) is central in sequence and genome analysis and is called alignment. This statistical approach is harder conceptually and computationally, than competing approaches based on choosing an alignment according to some optimality criteria. But it has major practical advantages in terms of testing evolutionary hypotheses and parameter estimation. Basic dynamic approaches can allow the analysis of up to 4—5 sequences. MCMC techniques can bring this to about 10—15 sequences. Beyond this, different or heuristic approaches must be used. Besides the computational challenges, increasing realism in the underlying models is presently being addressed. A recent development that has been especially fruitful is combining statistical alignment with the problem of sequence annotation, making statements about the function of each nucleotide/amino acid. So far gene finding, protein secondary structure prediction and regulatory signal detection has been tackled within this framework. Much progress can be reported, but clearly major challenges remain if this approach is to be central in the analyses of large incoming sequence data sets.

Get full access to this article

View all access options for this article.

References

Needleman S. , Wunsch C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970; 48(3): 443-53.

Krogh A. , Brown M. , Mian I. , Sjolander K. , Haussler D. Hidden Markov models in computational biology: Applications to protein modeling . J Mol Biol 1994; 235: 1501-31.

Smith T. , Waterman M. Identification of common molecular subsequences. J Mol Biol 1981; 147(1): 195-97.

Altschul S. , Gish W. , Miller W. , Myers E. , Lipman D. Basic local alignment search tool. J Mol Biol 1990; 215(3): 403-10.

Fitch W. Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool 1971; 20: 406-16.

Hartigan J. Minimum evolution fits to a given tree. Biometrics 1973; 29: 53-65.

Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981; 17(6): 368-76.

Felsenstein J. The troubled growth of statistical phylogenetics . Syst Biol 2001; 50: 465-67.

Thorne J. , Kishino H. , Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences . J Mol Evol 1991; 33(2): 114-24.

10.

Miklós I. , Lunter GA , Holmes I. A ‘long indel’ model for evolutionary sequence alignment. Mol Biol Evol 2004; 21(3): 529-40.

11.

Steel M. , Hein J. Applying the Thorne-Kishino-Felsenstein model to sequence evolution on a star-shaped tree. Appl Math Let 2001; 14: 679-84.

12.

Hein J. An Algorithm for Statistical Alignment of Sequences Related by a Binary Tree . In: Pacific Symposium on Biocomputing, vol. 6; 2001, pp. 179-90.

13.

Hein J. , Jensen J. , Pedersen C. Recursions for statistical multiple alignment. PNAS 2003; 100(25): 14960-65.

14.

Lunter G. , Miklós I. , Drummond A. , Jensen J. , Hein J. Bayesian phylogenetic inference under a statistical indel model. Lect Notes Bioinf 2003; 2812: 228-44.

15.

Lunter G. , Miklós I. , Drummond A. , Jensen J. , Hein J. Bayesian Coestimation of Phylogeny and Sequence Alignment. BMC Bioinformatics 2005; 6: 83.

16.

Churchill G. Monte Carlo Sequence Alignment. In: Proceedings of RECOMB 97; 1997, pp. 93-97.

17.

Holmes I. , Bruno W. Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 2001; 17(9): 803-20.

18.

Thorne J. , Kishino H. , Felsenstein J. Inching toward reality: an improved likelihood model of sequence evolution . J Mol Evol 1992; 34(1): 3-16.

19.

Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982; 162: 705-08.

20.

Felsenstein J. Evolutionary trees from DNA sequences : a maximum likelihood approach. J Mol Evol 1981 ; 17: 68-376.

21.

Metzler D. Statistical alignment based on fragment insertion and deletion models . Bioinformatics 2003; 19(4): 490-99.

22.

Goldman N. Effects of sequence alignment procedures on estimates of phylogeny . BioEssays 1998; 20: 287-90.

23.

Wong K. , Suchard M. , Huelsenbeck J. Alignment uncertainty and genomic analysis. Science 2008; 319(5862): 473-6.

24.

Rannala B. , Yang Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J Mol Evol 1996 ; 43: 304-11.

25.

Drummond A. , Nicholls G. , Rodrigo A. , Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data . Genetics 2002; 161(3): 1307-20.

26.

Durbin R. , Eddy S. , Krogh A. , Mitchison G. Biological sequence analysis. Probabilistic models of proteins and nucleic acids, Cambridge University Press, Cambridge; 1998.

27.

Huson D. , Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 2006; 23(2): 254-67.

28.

Novák Á. , Miklós I. , Lyngsø R. , Hein J. Stat Align: An extendable software package for joint bayesian estimation of alignments and evolutionary trees. Bioinforamtics 2008; 24(20): 2403-4.

29.

Mizuguchi K. , Deane C. , Blundell T. , JP O. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci 1998; 7: 2469-71.

30.

Miklós I. , Novák Á. , Dombai B. , Hein J.

How reliably can we predict the reliability of protein structure predictions?

BMC Bioinformatics 2008; 9: 137.

31.

Holland B. , Moulton V. Consensus Networks: A Method for Visualising Incompatibilities in Collections of Trees. Lecture Notes in Computer Science, In: Proceedings of WABI2003, 2003; 2812: 165-76.

32.

Waterman M. , Smith T. , Beyer W. Some biological sequence metrics. Advan Math 1976; 20: 367-87.

33.

Waterman M. Parametric and ensemble sequence alignment algorithms . Bull Math Bio 1994; 5(4): 743-67.

34.

Kececioglu J. , Kim E. Simple and fast inverse alignment . Lect Notes Comp Sci 2006; 3909: 441-55.

35.

Knudsen B. , Miyamoto M. Sequence alignments and pair hidden Markov models using evolutionary history . J Mol Biol 2003; 333: 453-60.

36.

Löytynoja A. , Milinkovitch M. A hidden Markov model for progressive multiple alignment. Bioinformatics 2003; 19(12): 1505-13.

37.

Wang L. , Jiang T. On the complexity of multiple sequence alignment. J Comp Biol 1994; 1(4): 337-48.

38.

Karplus K. , Barrett C. , Hughey R. Hidden markov models for detecting remote protein homologies. Bioinformatics 1998; 14(10): 846-56.

39.

Eddy S. Profile Hidden Markov Models. Bioinformatics 1998; 14: 755-63.

40.

Hogeweg P. , Hesper B. The alignment of sets of sequences and the construction of phyletic trees: An integrated method. J Mol Evol 1984 ; 20(2): 175-86.

41.

Feng D. , Doolittle R. Progressive sequence alignment as a prerequisite to correct phylogenetic trees . J Mol Evol 1987; 25: 351-60.

42.

Löytynoja A. , Goldman N. An algorithm for progressive multiple alignment of sequences with insertions. PNAS 2005; 102(30): 10557-62.

43.

Holmes I. Using guide trees to construct multiple-sequence evolutionary HMMs . Bioinformatics 2003; 19(90001): 147-57.

44.

Bradley R. , Holmes I. Transducers: An emerging probabilistic framework for modeling indels on trees . Bioinformatics 2007; Doi:10.1093/bioinformatics/btm402.

45.

Metzler D. , Fleissner R. , von Haeseler A. , Wakolbinger A. Assessing variability by joint sampling of alignments and mutation rates. J Mol Evol 2001; 53: 660-69.

46.

Fleissner R. , Metzler D. , von Haesaler A. Simultaneous Statistical Multiple Alignment and Phylogeny Reconstruction. Syst Bio 2005 ; 54: 548-61.

47.

Redelings B. , Suchard M. Joint Bayesian estimation of alignment and phylogeny. Syst Biol 2005; 50: 401-18.

48.

Suchard M. , Redelings B. BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny . Bioinformatics 2006; 22(16): 2047-8.

49.

Metropolis N. , Rosenbluth A. , Rosenbluth M. , Teller A. , Teller E. Equations of state calculations by fast computing machines. J Chem Phys 1953; 21(6): 1087-91.

50.

Hastings W. Monte Carlo sampling methods using Markov chains and their applications. Biometrica 1970; 57(1): 97-109.

51.

Ronquist F. , Huelsenbeck J. MrBayes 3: bayesian phylogenetic inference under mixed models. Bioinformatics 2003; 19(12): 1572-4.

52.

Mizuguchi K. , Deane C. , Johnson M. , Blundell T. , Overington J. JOY: protein sequencestructure representation and analysis. Bioinformatics 1998; 14: 617-23.

53.

Gusfield D. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge; 1997.

54.

Hubbard T. , Lesk A. , Tramontano A. Gathering them into the fold. Nature Structural Biology 1996; 3: 313.

55.

Skolnick J. , Kolinski A. , Kihara D. , Betancourt M. , Rotkiewicz PMB Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement . Proteins 2002; 44(S5): 149-56.

56.

Wu S. , Skolnick J. , Zhang Y. Ab initio modeling of small proteins by iterative TASSER simulations . BMC Biology 2007; 5: 17.

57.

Zhou H. , Skolnick J. Ab initio protein structure prediction using chunk-TASSER. Biophysical Journal 2007; 93: 1510-18.

58.

Goldman N. , Thorne J. , Jones D. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol 1996; 263(2): 196-208.

59.

Kneller D. , Cohen F. , Langridge R. Improvements in protein secondary structure prediction by an enhanced neural network. J Mol Biol 1990; 214: 171-82.

60.

Garnier J. , Gibrat JF , BR. GOR secondary structure prediction method version IV. Methods in Enzymology 1996; 266: 540-53.

61.

Stark A. , Lin M. , Kheradpour P. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 2007; 450(7167): 219-32.

62.

Cliften P. , Sudarsanam P. , Desikan A. et al. Finding functional features in saccharomyces genomes by phylogenetic footprinting. Science 2003; 301(5629): 71-6.

63.

Boffelli D. , McAuliffe J. , Ovcharenko D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome . Science 2003; 299(5611): 1391-4.

64.

Wasserman W. , Palumbo M. , Thompson W. , Fickett J. , Lawrence C. Human-mouse genome comparisons to locate regulatory sites. Nature Genetics 2000; 26: 225-8.

65.

Siepel A. , Bejerano G. , Pedersen J. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 2005; 15(8): 1034-50.

66.

Tagle D. , Koop B. , Goodman M. , Slightom J. , Hess D. , Jones R. Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol 1988 ; 203(2): 439-55.

67.

Guha Thakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Research 2006; 34(12): 3585-98.

68.

Pollard D. , Moses A. , Iyer V. , Eisen M. Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments. BMC Bioinformatics 2006; 7: 376.

69.

Lunter G. , Rocco A. , Mimouni N. , Heger A. , Caldeira A. , Hein J. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. Genome Res 2007; 8: 298-309.

70.

Fan X. , Zhu J. , Schadt E. , Liu J. Statistical power of phylo-HMM for evolutionarily conserved element detection . BMC Bioinformatics 2007; 8: 374.

71.

Zhu J. Bayesian adaptive sequence alignment algorithms. Bioinformatics 1998; 14(1): 25-39.

72.

Sinha S. , He X. MORPH: Probabilistic alignment combined with hidden Markov models of cis-regulatory modules. PLoS Comput Biol 2007; 3(11): e216.

73.

Satija R. , Pachter L. , Hein J. Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics 2008; 24(10): 1236-42.

74.

Eddy S. A model of the statistical power of comparative genome sequence analysis . PLoS Biology 2005; 3(1): e10.

75.

Lunter G. , Drummond A. , Miklós I. , Hein J. Statistical alignment: recent progress, new applications, and challenges . In: Nielsen, R. (ed): Statistical methods in molecular evolution. Springer-Verlag , New-York; 2004, pp. 381-412.

76.

Lunter G. , Miklós I. , Song Y. , Hein J. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J Comp Biol 2003; 10(6): 869-89.

77.

Chao KM , Pearson W. , Miller W. Aligning two sequences within a specified diagonal band. Comp Appli Biosci (CABIOS) 1992; 8(5): 481-7.

78.

Lunter G. HMMoC-a compiler for hidden markov models. Bioinformatics 2007; 23(18): 2485-7.

79.

de Groot S. , Mailund T. , Lunter G. , Hein J. Investigating selection on viruses: a statistical alignment approach. BMC Bioinformatics 2008 ; 9: 304.

80.

Hein J. , Wiuf C. , Knudsen B. , Moller M. , Wibling G. Statistical alignment: computational properties, homology testing and goodness-of-fit . J Mol Biol 2000; 302: 265-79.

81.

Holmes I. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics 2005; 6: 73.

82.

Do C. , Mahabhashyam M. , Brudno M. , Batzoglou S. ProbCons: probabilistic consistency based multiple sequence alignment . Genome Research 2005; 15: 330-40.

83.

Hein J. Unified approach to alignment and phylogenies. Methods in Enzymology 1990; 183: 626-45.

84.

Robins G. , Zelikovsky A. Improved steiner tree approximation in graphs. In Proceedings of the 11th Annual Symposium on Discrete Algorithms (SODA). 2000; 770-779.

85.

Qian B. , Goldstein R. Distribution of indel lengths. Proteins: Struc Func Gen 2001; 45: 102-4.

86.

Lunter G. , Hein J. A nucleotide substitution model with nearest-neighbour interactions . Bioinformatics 2004; 20: i216-i223.

87.

Arndt P. , CB B. , Hwa T. DNA sequence evolution with neighbor-dependent mutation. J Comp Biol 2003; 10: 313-22.

88.

Pedersen AM , Jensen J. A dependent rates model and MCMC based methodology for the maximum likelihood analysis of sequences with overlapping reading frames. Mol Biol Evol 2001; 18: 763-76.

89.

Robinson D. , Jones D. , Kishino H. , Goldman N. , Thorne J. Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol 2003 ; 20: 1692-704.

90.

Holmes I. A probabilistic model for the evolution of RNA structure. BMC Bioinformatics 2004; 5: 166.

91.

Holmes I. , Rubin G. Pairwise RNA structure comparison using stochastic context-free grammars . In: Pacific Symposium on Biocomputing; 2002 , pp. 163-74.

92.

Nye T. Modelling the evolution of multi-gene families. Statistical Methods in Medical Research. 2008.

93.

Bishop M. , Thompson E. Maximum likelihood alignment of DNA sequences. J Mol Biol 1986; 190(2): 159-65.

94.

Liu J. Monte Carlo strategies in scientific computing. Springer Verlag, New York, 2001.

95.

Green P. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination . Biometrika 1995; 82: 711-32.

Stochastic models of sequence evolution including insertion—deletion events

Abstract

Get full access to this article

References