Sage Journals: Discover world-class research

Abstract

Reconstruction of population histories is a central problem in population genetics. Existing coalescent-based methods, such as the seminal work of Li and Durbin, attempt to solve this problem using sequence data but have no rigorous guarantees. Determining the amount of data needed to correctly reconstruct population histories is a major challenge. Using a variety of tools from information theory, the theory of extremal polynomials, and approximation theory, we prove new sharp information-theoretic lower bounds on the problem of reconstructing population structure—the history of multiple subpopulations that merge, split, and change sizes over time. Our lower bounds are exponential in the number of subpopulations, even when reconstructing recent histories. We demonstrate the sharpness of our lower bounds by providing algorithms for distinguishing and learning population histories with matching dependence on the number of subpopulations. Along the way and of independent interest, we essentially determine the optimal number of samples needed to learn an exponential mixture distribution information-theoretically, proving the upper bound by analyzing natural (and efficient) algorithms for this problem.

Get full access to this article

View all access options for this article.

References

Bhaskar

, and Song

Y.S.

2014. Descartes' rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Statist. 42, 2469.

Bhaskar

, Wang

Y.R.

, and Song

Y.S.

2015. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25, 268–279.

Blythe

R.A.

, and McKane

A.J.

2007. Stochastic models of evolution in genetics, ecology and linguistics. J. Stat. Mech. Theory Exp. 2007, P07018.

Candès

E.J.

, and Fernandez-Granda

2013. Super-resolution from noisy data. J. Fourier Anal. Appl. 19, 1229–1254.

Drummond

, Rambaut

, Shapiro

, et al. 2005. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 22, 1185–1192.

Excoffier

, Dupanloup

, Huerta-Sánchez

, et al. 2013. Robust demographic inference from genomic and SNP data. PLoS Genet. 9, e1003905.

Feldmann

, and Whitt

1998. Fitting mixtures of exponentials to long-tail distributions to analyze network performance models. Perf. Eval. 31, 245–279.

Gautschi

1962. On inverses of vandermonde and confluent vandermonde matrices. Numer. Math. 4, 117–123.

Heled

, and Drummond

2008. Bayesian inference of population size history from multiple loci. BMC Evol. Biol. 8, 289.

10.

Hua

, and Sarkar

T.K.

1990. Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise. IEEE Trans. Acous. Speech Signal Proc. 38, 814–824.

11.

Joseph

T.A.

, and Pe'er

2018. Inference of population structure from ancient DNA, 90–104. In RECOMB. Springer, Cham.

12.

Kim

, Mossel

, Rácz

M.Z.

, et al. 2015. Can one hear the shape of a population history? Theor. Popul. Biol. 100:26–38.

13.

, and Durbin

2011. Inference of human population history from individual whole-genome sequences. Nature, 475, 493.

14.

McVean

G.A.

, and Cardin

N.J.

2005. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360, 1387–1393.

15.

Moitra

2015. Super-resolution, extremal functions and the condition number of vandermonde matrices, 821–830. Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC 2015, ACM, New York, NY.

16.

Myers

, Fefferman

, and Patterson

2008. Can one learn history from the allelic spectrum?. Theor. Popul. Biol. 73, 342–348.

17.

Nazarov

F.L.

1993. Local estimates for exponential polynomials and their applications to inequalities of the uncertainty principle type. Algebra i Analiz, 5, 3–66.

18.

Nielsen

2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics, 154, 931–942.

19.

Nordborg

2001. Coalescent theory. Handb. Stat. Genet. 2, 843–877.

20.

Prony

1795. Essai xperimental et analytique: sur les lois de la dilatabilit de uides lastique et sur celles de la force expansive de la vapeur de l'alkool, direntes tempratures. J. Ec. Polytech. Math. 1, 24–76.

21.

Schiffels

, and Durbin

2014. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919.

22.

Sheehan

, Harris

, and Song

Y.S.

2013. Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics, 194, 647–662.

23.

Terhorst

, Kamm

J.A.

, and Song

Y.S.

2017. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat. Genet. 49, 303.

24.

Terhorst

, and Song

Y.S.

2015. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl. Acad. Sci. U. S. A. 112, 7677–7682.

25.

Turán

1984. On a New Method of Analysis and Its Applications. Wiley New York, New York, NY.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.39 MB

How Many Subpopulations Is Too Many? Exponential Lower Bounds for Inferring Population Histories

Abstract

Get full access to this article

References

Supplementary Material