Abstract
One way to understand biology is by finding genetic sequences that are related to each other. Often, a family of related sequences has position-varying probabilities of substitutions, insertions, and deletions: we can use these to find distantly related sequences. There are popular software tools to do this, which all have limitations. They either do not use all probability evidence (e.g., PSI-BLAST, MMseqs2) or have excessive complexity and minor biases (e.g., HMMER). This complexity inhibits fertile development of alternative tools.
This study describes a simplest reasonable way to find related sequences, making full use of position-varying probabilities. The algorithms likely use the fewest operations that such algorithms possibly could, so they are fast and simple. This has been implemented in prototype software named DUMMER (Dumb Uncomplicated Match ModelER). Its sensitivity and specificity are competitive with HMMER. It finds evidence that the human genome has many more relics of some ancient transposons, including LF-SINE, which was co-opted for various functions in common ancestors of all land vertebrates.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
