Sage Journals: Discover world-class research

Abstract

Genomic variations are in the focus of research to uncover mechanisms of host–pathogen interactions and diseases such as cancer. Nowadays, next-generation sequencing (NGS) data are analyzed through dedicated pipelines to detect them. Surrogate NGS data in conjunction with genomic variations help to evaluate pipelines and validate their outcomes, fostering selection of proper tools for a given scientific question. I describe how existing approaches for simulating NGS data in conjunction with genomic variations fail to model local enrichments of single nucleotide polymorphisms (SNPs), so called SNP clusters. Two distributions for count data are applied to publicly available collections of genomic variations. The results suggest modeling of SNP cluster sizes by overdispersion-aware distributions.

Get full access to this article

View all access options for this article.

References

Amos

2010. Even small SNP clusters are non-randomly distributed: Is this evidence of mutational non-independence?. Proc. R. Soc. Lond. B Biol. Sci., 277, 1443–1449.

The 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073.

Danecek

, Auton

, Abecasis

, et al. 2011. The variant call format and VCFtools. Bioinformatics, 27, 2156–2158.

Dormans

2011. Beyond iconic simulation. Simul Gaming, 42, 610–631.

Escalona

, Rocha

, and Posada

2016. A comparison of tools for the simulation of genomic next-generation sequencing data. Nat. Rev. Genet. 17, 459–469.

Fox

E.J.

, Reid-Bayliss

K.S.

, Emond

M.J.

, et al. 2014. Accuracy of next generation sequencing platforms. Next Gener. Seq. Appl., 1, pii: 1000106.

Hodgkinson

, and Eyre-Walker

2010. The genomic distribution and local context of coincident SNPs in human and chimpanzee. Genome Biol. Evol., 2, 547–557.

, Yuan

, Shi

, et al. 2012. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics, 28, 1533–1535.

Huang

, Li

, Myers

J. R.

, et al. 2012. ART: A next-generation sequencing read simulator. Bioinformatics, 28, 593–594.

10.

Huang

, Massouras

, Inoue

, et al. 2014. Natural variation in genome architecture among 205 Drosophila melanogaster genetic reference panel lines. Genome Res. 24, 1193–1208.

11.

Jia

, Huang

, Zhi

, et al. 2013. A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica). Nat. Genet., 45, 957–961.

12.

Krøigård

A.B.

, Thomassen

, Lænkholm

A.-V.

, et al. 2016. Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLoS One, 11, e0151664.

13.

Langmead

, and Salzberg

S.L.

2012. Fast gapped-read alignment with Bowtie 2. Nat. Methods., 9, 357–359.

14.

Langmead

, Trapnell

, Pop

, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25.

15.

Lateef

2010. Simulation-based learning: Just like the real thing. J. Emerg. Trauma Shock, 3, 348.

16.

, and Durbin

2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.

17.

, and Durbin

2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595.

18.

, Ruan

, and Durbin

2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858.

19.

Mackay

T.F.C.

, Richards

, Stone

E.A.

, et al. 2012. The Drosophila melanogaster genetic reference panel. Nature, 482, 173–178.

20.

Manley

L.J.

, Ma

, and Levine

S.S.

2016. Monitoring error rates in illumina sequencing. J. Biomol. Tech. 27, 125.

21.

J. C.

, Mohiyuddin

, Li

, et al. 2015. VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. Bioinformatics, 31, 1469–1471.

22.

Ong

S.H.

, Biswas

, Peiris

, et al. 2015. Count distribution for generalized weibull duration with applications. Commun. Stat. Theory Methods, 44, 4203–4216.

23.

Pattnaik

, Gupta

, Rao

A.A.

, et al. 2014. SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15, 40.

24.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

25.

Robinson

M.D.

, and Smyth

G.K.

2007. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321–332.

26.

Sachidanandam

, Weissman

, Schmidt

S.C.

, et al. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928–933.

27.

Sainudiin

, Clark

A.G.

, and Durrett

R.T.

2007. Simple models of genomic variation in human SNP density. BMC Genomics, 8, 146.

28.

Sims

, Sudbery

, Ilott

N.E.

, et al. 2014. Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet., 15, 121–132.

29.

Song

, Li

, and Zhang

2016. Coverage recommendation for genotyping analysis of highly heterologous species using next-generation sequencing technology. Sci. Rep. 6, 35736.

30.

Tenaillon

M.I.

, Austerlitz

, and Tenaillon

2008. Apparent mutational hotspots and long distance linkage disequilibrium resulting from a bottleneck. J. Evol. Biol., 21, 541–550.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.22 MB

ShRangeSim: Simulation of Single Nucleotide Polymorphism Clusters in Next-Generation Sequencing Data

Abstract

Abstract

Get full access to this article

References

Supplementary Material