Sage Journals: Discover world-class research

Abstract

A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalize the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pan-genome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pan-genome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test. In addition, we test the use of a pan-genome for describing variations and as a reference for read mapping.

Get full access to this article

View all access options for this article.

References

1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073.

Berard

, Chateau

, Chauve

, et al. 2009. Computation of perfect dcj rearrangement scenarios with linear and circular chromosomes. J. Comput. Biol., 16, 1287–1309.

Bertrand

, Blanchette

, and El-Mabrouk

2009. Genetic map refinement using a comparative genomic approach. J. Comput. Biol., 16, 1475–1486.

Coffey

A.J.

, Kokocinski

, Calafato

M.S.

, et al., 2011. The gencode exome: sequencing the complete human exome. Eur. J. Hum. Genet., 19, 827–831.

ENCODE Project Consortium, Myers

R.M.

, Stamatoyannopoulos

, et al. 2011. A user's guide to the encyclopedia of dna elements (encode). PLoS Biol., 9, e1001046.

Erdos

, and Rényi

1960. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 5, 17–61.

Fagin

, Kumar

, and Sivakumar

2002. Comparing Top k Lists. SIAM J. DISCRETE MATH, 17, 134–160.

Griffiths

A.J.F.

, Miller

J.H.

, and Suzuki

D.T.

1999. An introduction to genetic analysis.

Freeman

W.H.

, New York. Horton R., Gibson

, Coggill

, et al. 2008. Variation analysis and gene annotation of eight MHC haplotypes: the MHC haplotype project. Immunogenetics, 60, 1–18.

10.

Karp

1972. Reducibility among combinatorial problems. Plenum (Complexity of Computer Computations), 85–103.

11.

Kendall

1938. A new measure of rank correlation. Biometrika, 30, 81–93.

12.

Kirkpatrick

2010. How and why chromosome inversions evolve. PLoS Biol., 8, e1000501.

13.

, and Durbin

2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.

14.

Medvedev

, and Brudno

2009. Maximum likelihood genome assembly. J. Comput. Biol., 16, 1101–1116.

15.

Meyer

L.R.

, et al. 2013. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Research, 41, 64–69.

16.

Newman

2008. Max-cut. Encyclopedia of Algorithms, 1, 489–492.

17.

Paten

, Diekhans

, Earl

, et al. 2011a. Cactus graphs for genome comparisons. J. Comput. Biol., 18, 469–481.

18.

Paten

, Earl

, Nguyen

, et al. 2011b. Cactus: Algorithms for genome multiple sequence alignment. Genome Res., 21, 1512–1528.

19.

Pruitt

K.D.

, Tatusova

, Brown

G.R.

, and Maglott

D.R.

2012. NCBI reference sequences (refseq): current status, new features and genome annotation policy. Nucleic Acids Res., 40, D130–D135.

20.

Stewart

C.A.

, Horton

, Allcock

R.J.N.

, et al. 2004. Complete MHC haplotype sequencing for common disease gene mapping. Genome Res., 14, 1176–1187.

21.

Tannier

, Zheng

, and Sankoff

2009. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics, 10, 120.

22.

Traherne

J.A.

2008. Human MHC architecture and evolution: implications for disease association studies. Int. J. Immunogenet., 35, 179–192.

23.

Traherne

J.A.

, Horton

, Roberts

A.N.

, et al. 2006. Genetic analysis of completely sequenced disease-associated MHC haplotypes identifies shuffling of segments in recent human history. PLoS Genet., 2, e9.

24.

A.W.

2009. A fast and exact algorithm for the median of three problem: a graph decomposition approach. J. Comput. Biol., 16, 1369–1381.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.40 MB

Building a Pan-Genome Reference for a Population

Abstract

Abstract

Get full access to this article

References

Supplementary Material