Abstract
Structurama is a program for inferring population structure. Specifically, the program calculates the posterior probability of assigning individuals to different populations. The program takes as input a file containing the allelic information at some number of loci sampled from a collection of individuals. After reading a data file into computer memory, Structurama uses a Gibbs algorithm to sample assignments of individuals to populations. The program implements four different models: The number of populations can be considered fixed or a random variable with a Dirichlet process prior; moreover, the genotypes of the individuals in the analysis can be considered to come from a single population (no admixture) or as coming from several different populations (admixture). The output is a file of partitions of individuals to populations that were sampled by the Markov chain Monte Carlo algorithm. The partitions are sampled in proportion to their posterior probabilities. The program implements a number of ways to summarize the sampled partitions, including calculation of the ‘mean’ partition—a partition of the individuals to populations that minimizes the squared distance to the sampled partitions.
Introduction
Natural populations of organisms often exhibit some degree of population subdivision. Identifying this population structure is important for several reasons. Practically speaking, undetected population structure can adversely affect statistical tests for the presence of natural selection 1 or of genetic association. 2 Population structure is also known to affect the evolutionary dynamics of alleles in populations; understanding patterns of population subdivision, then, is often a first step in learning about the evolutionary forces affecting a species.
Identifying population structure is a difficult problem that has motivated a variety of statistical and computational approaches. Here, we focus only on Bayesian methods for inferring population structure.3–8 Pritchard et al
8
proposed a widely-used Bayesian method for inferring population structure. The simplest variant of the Pritchard et al
8
method assumes a fixed number of populations,
Like many of the other methods that have been proposed, the Bayesian one proposed by Pritchard et al 8 assumes a fixed number of populations. Determining the correct number of populations for a particular set of observations is itself a difficult problem. With some reluctance, Pritchard et al 8 suggested determining the number of populations by approximating the marginal likelihoods of the data when the number of populations varies. In short, one performs repeated analyses with different numbers of populations; the number of populations that results in the maximum marginal likelihood for the data is chosen as the optimal value for the analysis. In a simulation study, Evanno et al 9 found that the method based upon marginal likelihoods performs poorly. (The poor performance may be related to the instability of the harmonic mean estimator of the marginal likelihood; see. 10 )
More recently, Pella and Masuda 7 proposed a Bayesian method for determining population structure in which the number of populations is a random variable. Structurama implements the methods of Pritchard et al 8 and Pella and Masuda; 7 also see. 11 The program also implements a hierarchical variant of the Dirichlet Process prior model. 12 We use this model to account for admixture when the number of populations is considered a random variable. Structurama also implements a novel method for summarizing the results of a Bayesian analysis of population structure using the mean partition.
Approach
Pella and Masuda
7
assume that the assignment of individuals to populations and the number of populations follow a Dirichlet process prior.13,14 Like Pritchard et al
8
Pella and Masuda
7
assume Hardy-Weinberg equilibirum of allele frequencies within a population, linkage equilibrium of the loci, and a Dirichlet prior probability distribution for the allele frequencies within a population. Their application of a Dirichlet process prior to the problem, however, is original. The Dirichlet process prior has been described extensively elsewhere,13–15 and effective Markov chain Monte Carlo methods for sampling under this model have been described by Neal.
15
Here, we will provide an intuitive explanation of the Dirichlet process prior, which is sometimes referred to as the ‘Chinese Restaurant Table Process’.16,17 One imagines a (presumably very large) Chinese restaurant with a countably infinite number of tables. Patrons enter the restaurant one at a time (there are a total of
Under the Dirichlet process prior, one can calculate the probability of a particular configuration of patrons at tables, and importantly, this probability does not depend upon the order in which the patrons enter the restaurant. The joint probability of the assignment of individuals to tables and the number tables is
The parameter αdetermines the tendency of patrons to sit at the same table. If α is small, then patrons are more likely to sit at the same table. In fact, the probability that patron
Finally, the probability that a total of
where nak is the absolute value of the Stirling number of the first kind.
In the context of determining population structure, populations are equivalent to the ‘tables’ of the Chinese restaurant example. Moreover, all of the individuals in a particular population share a common set of allele frequencies. These allele frequencies are drawn from a flat Dirichlet prior probability distribution. (It is unfortunate that ‘Dirichlet’ is used to name two very different probability distributions: the Dirichlet probability distribution on allele frequencies and the Dirichlet process prior describing how individuals are grouped into populations.)
We also implemented a hierarchical version of the Dirichlet process prior model
12
that allows us to accommodate admixture while treating the number of populations as a random variable. The hierarchical Dirichlet process prior has not been applied to the problem of assigning individuals to populations. Under the hierarchical Dirichlet process prior, there are
We consider the assignment of individuals to populations to be a partition, where a partition is a division of a set into nonempty and disjoint sets which completely cover the set. Structurama implements Algorithm 3 of Neal
15
to sample partitions using a Gibbs sampling method when there is no admixture. We use a similar algorithm, described by Teh et al
12
for performing MCMC under the hierarchical Dirichlet procoess model (for a model with admixture). Partitions of individuals among populations are sampled in proportion to their posterior probabilities. The end result is a file with sampled partitions. Part of a sample of
where partitions are labeled according to the restricted growth function notation of Stanton and White. 18 The first sample taken from the Markov chain has three populations, with individuals 1, 2, 4, and 7 grouped together into one population, individuals 3, 8, 9, and 10 grouped together into a second population, and individuals 5 and 6 grouped together into a third population.
Although the meaning of any single partition is unambiguous, it can be difficult to describe features in common among a set of partitions. How can one summarize features in common for a collection of partitions? One approach is simply to assign each population an index in computer memory. Instead of reporting the restricted growth function notation for a partition, one simply reports the index for each individual. The problem with this approach is that if the MCMC works properly, the labels should switch. That is, the MCMC algorithm should visit the following two partitions equally often (they imply equivalent groupings of individuals into populations): (1,1,1,1,1,2,2,2,2,2) and (2,2,2,2,2,1,1,1,1,1). When the number of populations is fixed, this problem is more theoretical than practical because MCMC fails to visit equivalent labelings of the partitions. However, when the number of populations is a random variable, it is no longer suficient to use an arbitrary index for populations; the meaning of an index can change over the course of the Markov chain. Structurama summarizes the results on partitions using a number of methods. Perhaps the most notable is the use of the mean partition. We define the mean partition as the partition of individuals to populations which minimizes the sum of the squared distances to the sampled partitions. We use Gusfield's
19
distance on partitions. The partition distance is the minimum number of individuals that need to be removed from two partitions to make the induced partitions identical. Structurama also calculates the posterior probability of grouping each of the
Program Details
Structurama takes as input a text file containing the allelic information for the sampled loci for each individual. The file format is a structured one, in the style of the Nexus format used by many phylogeny programs. 20 The following illustrates the file format for a study of impala.21,22
Note that this input file has
After the data has been read into computer memory (using the execute command), the user specifies the details of the model using the model command. Here the user has four choices:
model numpops=<number> admixture=no
model numpops=<number> admixture=yes
model numpops=rv admixture=no
model numpops=rv admixture=yes
The first two commands specify the models described by Pritchard et al. 8 The third model specifies the Dirichlet process prior model without admixture described by Pella and Masuda 7 and Huelsenbeck and Andolfatto. 11 The final model is unique to Structurama and specifies the hierarchical Dirichlet process prior model. 12 The user also can specify a hierarchical prior for the parameters of the Dirichlet process model.
Once the user has specified the model, the Markov chain Monte Carlo analysis is performed using the mcmc command. The MCMC algorithm samples partitions in proportion to their posterior probabilities. The sampled partitions are saved to a file in the restricted growth function notation for partitions. 18 The user can summarize the results of the MCMC analysis using one of several commands. The posterior probabilities of individuals being assigned to the same population are obtained using the showtogetherness command. The posterior probability distribution for the number of populations is obtained using the shownumpops command. Finally, the mean partition is obtained using the showmeanpart command.
Program Availability
Structurama is available for download from www.structurama.org.
Disclosure
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
Footnotes
Acknowledgements
The development of this program was supported by NSF and NIH grants DEB-0445453 and GM-069801, respectively, awarded to J.P.H. The authors would also like to thank Youn-Ho Lee and the Kordi South Sea Institute for their support during the development of this program.
