Sage Journals: Discover world-class research

Abstract

Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, D₂, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{D}}_{\bf 2}^{\bf *}$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{D}}_{\bf 2}^S$$ \end{document} , both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{D}}_{\bf 2}^{\bf *}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{D}}_{\bf 2}^S$$ \end{document} outperform D₂ for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{D}}_{\bf 2}^{\bf *}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{D}}_{\bf 2}^S$$ \end{document} . Finally, variations of these statistics, d₂, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{d}}_{\bf 2}^{\bf *}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{d}}_{\bf 2}^{\bf S}$$ \end{document} , respectively, are used to first cluster five mammalian species with known phylogenetic relationships, and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{d}}_{\bf 2}^S$$ \end{document} are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{d}}_{\bf 2}^S$$ \end{document} provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.

Get full access to this article

View all access options for this article.

References

Blaisdell

B.E.

1986. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl. Acad. Sci. U.S.A., 83:5155–5159.

Cannon

C.H.

, Kua

C.S.

, Zhang

, Harting

J.R.

2010. Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack. Mol. Ecol., 19,(Suppl. 1):146–160.

Domazet-Lošo

, Haubold

2011. Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics, 27:1466–1472.

Hansen

K.D.

, Brenner

S.E.

, Dudoit

2010. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res., 38:e131.

Ivan

, Halfon

, Sinha

2008. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biol., 9,(1):R22.

Jun

S.R.

, Sims

G.E.

, Wu

G.A.

, Kim

S.H.

2010. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc. Natl. Acad. Sci. U.S.A., 107,(1):133–138.

Leung

, Eisen

M.B.

2009. Identifying CIS-regulatory sequences by word profile similarity. PLoS One., 4:e6901.

, Jiang

, Wong

W.H.

2010. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol., 11:R50.

Lippert

R.A.

, Huang

H.Y.

, Waterman

M.S.

2002. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl. Acad. Sci. U.S.A., 100:13980–13989.

10.

Liu

, Wan

, Li

et al. 2011. New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J. Theor. Biol., 284:106–116.

11.

Miller

, Rosenbloom

, Hardison

R.C.

et al. 2007. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res, 17:1797–1808.

12.

Reinert

, Chew

, Sun

F.Z.

, Waterman

M.S.

2009. Alignment-free sequence comparison (I): statistics and power. J. Comp. Biol, 16:1615–1634.

13.

Richter

D.C.

, Ott

, Auch

A.F.

et al. 2008. MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One., 3:e3373.

14.

Sims

G.E.

, Jun

S.R.

, Wu

G.A.

, Kim

S.H.

2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. U.S.A., 108:2677–2682.

15.

Vinga

, Almeida

2003. Alignment-free sequence comparison-a review. Bioinformatics, 19:513–523.

16.

Wan

, Reinert

, Sun

, Waterman

M.S.

2010. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J. Comput. Biol., 17:1467–1490.

17.

Zhai

Z.Y.

, Ku

S.Y.

, Luan

Y.H.

et al. 2010. The power of detecting enriched patterns: An HMM approach. J. Comput. Biol, 17:581–592.

18.

Zhang

Z.D.

, Rozowsky

, Snyder

et al. 2008. Modeling ChIP sequencing in silico with applications. PLoS Comput. Biol., 4:e1000158.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.39 MB

Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads

Abstract

Abstract

Get full access to this article

References

Supplementary Material