Simple and Thorough Detection of Related Sequences with Position-Varying Probabilities of Substitutions,Insertions,and Deletions

Abstract

One way to understand biology is by finding genetic sequences that are related to each other. Often, a family of related sequences has position-varying probabilities of substitutions, insertions, and deletions: we can use these to find distantly related sequences. There are popular software tools to do this, which all have limitations. They either do not use all probability evidence (e.g., PSI-BLAST, MMseqs2) or have excessive complexity and minor biases (e.g., HMMER). This complexity inhibits fertile development of alternative tools.

This study describes a simplest reasonable way to find related sequences, making full use of position-varying probabilities. The algorithms likely use the fewest operations that such algorithms possibly could, so they are fast and simple. This has been implemented in prototype software named DUMMER (Dumb Uncomplicated Match ModelER). Its sensitivity and specificity are competitive with HMMER. It finds evidence that the human genome has many more relics of some ancient transposons, including LF-SINE, which was co-opted for various functions in common ancestors of all land vertebrates.

Keywords

alignment homology LF-SINE probability

Get full access to this article

View all access options for this article.

References

Altschul

, Bundschuh

, Olsen

, et al. The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res, 2001; 29(2):351–361.

Altschul

, Madden

, Schäffer

, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res, 1997; 25(17):3389–3402.

Barrett

, Hughey

, Karplus

. Scoring hidden Markov models. Comput Appl Biosci, 1997; 13(2):191–199.

Bejerano

, Lowe

, Ahituv

, et al. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature, 2006; 441(7089):87–90.

Cameron

, Williams

, Cannane

. Improved gapped alignment in BLAST. IEEE/ACM Trans Comput Biol Bioinform, 2004; 1(3):116–129.

Eddy

. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol, 2008; 4(5):e1000069.

Frith

, Ni

. DNA conserved in diverse animals since the Precambrian controls genes for embryonic development. Mol Biol Evol, 2023; 40(12):msad275.

Frith

. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res, 2011; 39(4):e23–e23.

Frith

. A simple method for finding related sequences by adding probabilities of alternative alignments. Genome Res, 2024; 34(8):1165–1173.

10.

Frith

. How sequence alignment scores correspond to probability models. Bioinformatics, 2020; 36(2):408–415.

11.

Glidden-Handgis

, Wheeler

. WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. Bioinform Adv, 2024; 4(1):vbae052.

12.

Gotoh

. An improved algorithm for matching biological sequences. J Mol Biol, 1982; 162(3):705–708.

13.

Gribskov

, McLachlan

, Eisenberg

. Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci U S A, 1987; 84(13):4355–4358.

14.

Hubley

, Finn

, Clements

, et al. The Dfam database of repetitive DNA families. Nucleic Acids Res, 2016; 44(D1):D81–D89.

15.

Karlin

, Altschul

. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A, 1990; 87(6):2264–2268.

16.

Karplus

, Barrett

, Hughey

. Hidden Markov models for detecting remote protein homologies. Bioinformatics, 1998; 14(10):846–856.

17.

Krogh

, Brown

, Mian

, et al. Hidden Markov models in computational biology: Applications to protein modeling. J Mol Biol, 1994; 235(5):1501–1531.

18.

Liu

, Steinegger

. Block Aligner: An adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices. Bioinformatics, 2023; 39(8):btad487.

19.

Miyazawa

. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng, 1995; 8(10):999–1009.

20.

Notwell

, Chung

, Heavner

, et al. A family of transposable elements co-opted into developmental enhancers in the mouse neocortex. Nat Commun, 2015; 6(1):6644.

21.

Roddy

, Rich

, Wheeler

. nail: Software for high-speed, high-sensitivity protein sequence annotation. bioRxiv, 2024:2024.01.27.577580; doi: 10.1101/2024.01.27.577580

22.

Smith

, Waterman

. Identification of common molecular subsequences. J Mol Biol, 1981; 147(1):195–197.

23.

Steinegger

, Meier

, Mirdita

, et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics, 2019; 20(1):473.

24.

Steinegger

, Söding

. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol, 2017; 35(11):1026–1028.

25.

Suzuki

, Kasahara

. Acceleration of nucleotide semi-global alignment with adaptive banded dynamic programming. bioRxiv, 2017; doi: 10.1101/130633

26.

Wheeler

, Eddy

. nhmmer: DNA homology search with profile HMMs. Bioinformatics, 2013; 29(19):2487–2489.

27.

, Bundschuh

, Hwa

. Hybrid alignment: High-performance with universal statistics. Bioinformatics, 2002; 18(6):864–872.

28.

, Hwa

. Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J Comput Biol, 2001; 8(3):249–282.

29.

Zhang

, Berman

, Miller

. Alignments without low-scoring regions. J Comput Biol, 1998; 5(2):197–210.

30.

Zhang

, Berman

, Wiehe

, et al. Post-processing long pairwise alignments. Bioinformatics, 1999; 15(12):1012–1019.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.92 MB