Sage Journals: Discover world-class research

Abstract

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D₂ statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D₂ statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D₂ word count statistic, which we call D₂^S and D₂^*. For D₂^S, which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D₂^*, outperforms D₂^S in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D₂^*, we cannot provide a closed form for power calculations.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.08 MB

Alignment-Free Sequence Comparison (I): Statistics and Power

Abstract

Abstract

Supplementary Material