Abstract
ABSTRACT
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30–40 base pairs) approximate tandem repeats where copies may differ up to ϵ = 15–20% from a consensus model of the repeating unit (implying individual units may vary by 2ϵ from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 104 when ϵ = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as the span of the satellite. The first phase was designed for efficiency and takes only 0(n) time where n is the length of the sequence. The second phase was designed for sensitivity and takes time O(n · [unk](e, k)) in the worst case where k is the length of the repeating unit m,e = ⌊ϵk⌊ is the number of differences allowed between each repeat unit and the model m, and [unk] (e,k) is the maximum number of words that are not more than e differences from another word of length k. That is,[unk](e, k) is the maximum size of an e-neighborhood of a string of length k. Experiments reveal the second phase to be considerably faster in practice than the worst-case complexity bound suggests. Finally, the present algorithm is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.
Get full access to this article
View all access options for this article.
