Abstract
Using the structured serial coalescent with Bayesian MCMC and serial samples, we estimate population size when some demes are not sampled or are hidden, ie ghost demes. It is found that even with the presence of a ghost deme, accurate inference was possible if the parameters are estimated with the true model. However with an incorrect model, estimates were biased and can be positively misleading. We extend these results to the case where there are sequences from the ghost at the last time sample. This case can arise in HIV patients, when some tissue samples and viral sequences only become available after death. When some sequences from the ghost deme are available at the last sampling time, estimation bias is reduced and accurate estimation of parameters associated with the ghost deme is possible despite sampling bias. Migration rates for this case are also shown to be good estimates when migration values are low.
Keywords
Introduction
Frequently when we estimate the parameters associated with an island population model with migration using the structured Kingman coalescent (Kingman, 1982a, b), it is assumed that we sample all the demes in the population. In practice, sampling every deme may not be possible; frequently the data is just not available. It may be the case that we do not even know how many demes there are for the population of interest. Previous work on hidden (or ghost) demes was done by Beerli (2004). In that study he considered 3 subpopulations, 2 with sampled data and one that was not sampled. The study included a range of migration rates to and from the ghost deme. Estimates were then obtained with simulated data using a Maximum Likelihood approach with the program MIGRATE (Beerli and Felsenstein, 2001, 1999), both with and without the ghost deme. Beerli's results are summarised in Table 1. It was found that estimated population sizes of observed demes tend to be biased when the estimation model is different from the true model.
Summary of the results from this study and from Beerli (2004). We have excluded migration results from Beerli (2004). Models used for data generation and estimation are shown horizontally and vertically respectively. When λ = 0 there is no migration from or to the ghost deme and it is effectively excluded. Results show that the two studies agree qualitatively and illustrate a general trend that estimation with the incorrect demographic model leads to poor performance of the estimators.
In this paper we extend the work in Beerli (2004) to Bayesian estimation of coalescent-based population parameters using sequence samples obtained serially over time. In particular, we look at the ability of the Bayesian method presented in Ewing et al. (2004) to resolve population sizes when there are hidden or ghost demes with serial samples. We further extend this to the case where there is a single sample from the ghost deme at the last time point. This case can arise in HIV patients, for example when some tissue samples and viral sequences only become available after death (Wong et al. 1997; Nickle et al. 2003). The baseline model we use is a 2-deme model with samples coming from only one deme at 3 different times and possibly from the second deme at the last time point. In Beerli (2004) migration rates were also estimated. However, Beerli's study used maximum likelihood and was for isochronous (i.e., non-serial sample) data only.
By focusing on population size, we can restrict our models to just 2 demes in order to reduce the times needed for inference and simplify analysis. We look at the problem of estimation with and without the ghost deme and consider the related problem of inference under the assumption of a ghost deme when the true model has none.
The rest of this paper is set out as follows. In the next section we discuss the details of the simulation datasets and the procedures used to estimate parameters over these datasets. In Section 3 we outline the main results and show appropriate summary statistics. Comparison to the general findings of Beerli (2004) are mentioned and we present results under the ghost deme model while the truth contains no such deme. We also consider how often the truth would be rejected from a 95% Highest Posterior Density interval (HPD). Finally we discuss these results with regard to practical approaches to applying this type of inference to real problems.
Methods
We now discuss the specific parameters used for simulation and estimation. Data were simulated under the coalescent with the appropriate model, then parameters inferred with the same code that was presented in Ewing et al. (2004); Ewing and Rodrigo (2006). Table 1 shows the different data generation and estimation models considered. We produce a genealogy from the coalescent prior based on the given true parameters and then generate simulated sequences. The length of the simulated sequences was relatively short (250 base pairs) to allow for faster likelihood evaluations and to better replicate the uncertainty of branch lengths and topology that we see in real data of greater length. The simulated data consists of 3 time points equally spaced with 10 sequences per time point per deme. The distance between time points was 7.5 × 10–2 expected number of substitutions per unit time with a normalised mutation rate of 1. This produced small trees (30 sequences) but the population size can still be accurately inferred from such a dataset as shown in Ewing et al. (2004) and Drummond et al. (2002). Following the notation in Ewing and Rodrigo (2006), population size was parameterised as θ
For each set of simulated data parameters, 51 datasets were generated and the parameters inferred with the Bayesian MCMC (Ewing et al. 2004; Ewing and Rodrigo, 2006). Estimation was done with constrained symmetric migration rates. That is Schematic representation of the model. The dashed circle represents the hidden and unsampled ghost deme.
Finally we generated 51 sets of zero-migration coalescent genealogies (i.e., models where no ghost deme is present) as in Beerli (2004). For this model θ1 = 0.1; We estimated the true parameters assuming the absence of a ghost deme, and under the assumption that the ghost deme is present.
Mode estimators from the posterior are used as estimators due to the long tails that are common for marginal posterior density plots (Figure 2) and provide both lower bias and lower sensitivity to priors. The mode is estimated by noting the posterior density is inversely proportional to the distance between a fixed number of ordered values. We use 4% of the samples for the ordered values and take a median for the estimate. We have found that this provides very good estimates and is accurate when compared with Gaussian kernel posterior density estimates (Ewing et al. 2004). We only include estimates for sampled demes due to the high level of uncertainty for un-sampled demes, i.e., we do not report Marginal posterior density for 
95% Highest Posterior Density intervals were estimated in a similar fashion. We simply look for the shortest distance between ordered samples that encloses 95% of the samples. This method fails if the posterior is multi-modal, and in all cases marginal posterior densities were checked visually to ensure that they are unimodal.
In this study, we counted the proportion of 95% HPD's that enclosed the true value of θ1, as well as the the value of θ1 + θ2, where appropriate. The latter value of θ1 + θ2 represents the true effective size under an equivalent panmictic model (Nagylaki, 1980). Hence under high migration rates, it is expected the estimated θ1 for the sampled population will approach θ1 + θ2.
Results
The results showing the means of the modes for the different cases and migration rates. A clear trend is observed when the hidden deme is not taken into account (case I), and we see the truth is rejected more than expected even for low migration rates. As migration tends to large values the estimate of
Case I(a): Estimation without a Ghost Deme when λ = 0
First we consider the base case when there is no ghost deme and estimation is performed without a ghost deme. Due to the smaller number of parameters we expect this to be the case of best performance. In this case the true value of
Case I(b): Estimation with a Ghost Deme with λ = 0
For this section we consider the performance when the data is generated with λ = 0 or without a ghost deme, but we assume that a ghost deme exists for estimation. The estimation of Traces of θ1 and θ2. The clear jumps correspond directly to genealogy migration patterns. Most coalescent events are in one or the other deme, corresponding directly with either poorly determined population size (~ 0 coalescent events) or a well defined population size.
In Figure 4 the peak close to λ = 0 indicates that there is substantial probability around this true value. But this peak is small in terms of the total probability that it represents and the marginal density indicated that this is not distinguishable from high migration. We now need to consider if the lack of support for low migration, when the truth is zero, is due to mixing. The chain can only slowly remove migration events and in turn lower migration rates, thereby only mixing down to the lower migration values rarely. Even the recolour move described in the appendix in Ewing and Rodrigo (2006), does not appear to help the mixing. A better move (MCMC kernel proposal) would be to change the migration rate and regenerate the migration nodes as a single move. We have not implemented this move for this study. We did inspect the 51 marginal posterior densities and we noted a small number that rejected low migration. When we extend these runs
1
(× 2–5 longer), it was found that in most cases the peak at λ = 0 would manifest itself, and in a majority of cases this peak was the mode. This is a clear indication that the migration in particular does not mix properly despite the high ESS. This can be understood from the small chance of getting close to small values of λ with a random walk over permissable values and indicates that there could be some sensitivity to the prior bound. Some exploratory chains were run with larger priors with the result of a reduction in the peak at zero and reduced mixing.
Marginal Posterior Density for migration. There is a clear peak at zero. The rest of the posterior is noisy due to the poor mixing over this large range and the ESS for this parameter was the lowest for this run at 150. The prior is a flat bounded prior from 0 to 250.
The code supports extending any MCMC chain from where it stops. It records the machine state in a file every twenty minutes and then at the end of a run. It is then easy to “restart” a run from where it stopped and make it longer.
Furthermore we attempted to explore the use of a flat prior in log space or a Jeffreys prior on all parameters with the effect of hugely increasing the mixing of the
This scenario illustrates the potential bias a flat bounded prior has on Bayesian inference. As we increase the upper bound on λ we increase prior support for a non-zero migration rate. With the current prior, a non zero migration rate is never rejected from a 95% confidence interval. We will discuss how we might overcome these limitations in section 4. Prior sensitivity analysis was not carried out except as mentioned above and formal hypothesis testing would need more thorough treatment.
Case II(a): Estimation without a Ghost Deme with λ > 0
When the migration rate is not zero and we estimate without a ghost deme the acceptance of the true value for
Case II(b): Estimation with a Ghost Deme with λ > 0
If we include the unobserved deme in the estimation process, this bias is strongly suppressed. We see from Table 2 that when we estimate with a ghost deme mode estimators of θ1 are not far off from the truth. We accept the true value as expected and we reject the false values at about the same rate as the base case or case I(a) (the first column of Table 2 of case I). We have omitted the summary statistics for parameters of the ghost deme itself due to the fact that the marginal posterior densities for θ1 varie only slightly from the prior. That is, it is flat and extends out to the bounds. Marginal migration posterior densities are also generally uninformative. With low migration rates (5 or less) there was support for a region smaller than the prior bounds, but it was still diffuse. At high migration rates the marginal posterior density did exclude low migration rates, but was otherwise flat out to the bounds.
Case III: Estimation with a Few Sequences from the Ghost Deme
When we have just a few sequences (10) from the ghost deme at the most recent sample time only, the power of the method increases significantly. From Table 2 we see that even this small amount of data has reduced the bias seen under estimation with a ghost deme. With this data we also have informative marginal posterior densities for population size for both demes as shown in Figure 5.
Marginal posterior density for θ1 and θ2 when there are a few sequences collected from the ghost deme. Both posterior densities are well resolved.
There is a transition between low and high migration rates with regard to estimation of migration rates however. At λ = 20 the marginal migration posteriors often look like Figure 6(b) and we see they simply support any migration rate that is large. In this regime we note that population size estimates for the ghost deme become less informative and a long tail becomes the common distribution. However, the mode is still well defined for θ1 and we are able to make informed estimates of population sizes.
Marginal posterior density for 
At low migration rates λ = 5, migration is somewhat resolved with a clearmode and overestimates the true value as previously reported and shown in Figure 6(b). With these reasonably well defined migration posteriors, particularly at low migration rates, there is larger variance for the estimate of population size of the ghost deme due to a smaller number of sequences and lack of data at other sample times.
From visual inspection of the traces it was observed that there were significant mixing issues for some datasets with large migration (λ ≥ 20). The nature of the slow mixing was identical to that already discussed in Section 3.0.2. ESS for some runs were therefore quite small (200–500) and long run times were often required. This further illustrates the point that a better MCMC proposal kernel is needed to improve the mixing and allow larger datasets to be practical.
Discussion
In this paper we have looked at the ability to infer population size and migration rate using a coalescent-based Bayesian MCMC approach when we have serial samples from just one deme, while the true underlying population structure includes a hidden or ghost deme with migration between them. The case when we have samples from both demes only at the last time point was also presented. We also consider the related problem of estimating population size assuming a ghost deme when the true model does not contain this unobserved deme. It was found that in the former case, where there is a hidden deme when it is not taken into account, it overestimates the population size. As migration rates tend to large values, the estimate of population size tends to the sum of the subpopulations as we expect. For the opposite case, where the hidden deme is incorrectly included,
In all cases, estimating under the correct model gave informative inference and reduced biases as compared to estimation under the incorrect model. The inclusion of a ghost deme when there is such a ghost deme effectively removes the dependence of estimated population size with migration rate. However, there is still a slight downward bias in the mean estimators, and migration estimates were very diffuse. The true value for θ1 was within the 95% HPD interval as expected and we could reject
Adding a small number of sequences at the last time point for the ghost deme made a substantial difference. An instance where this type of data may arise is with HIV populations within hosts. Often we can only sample some tissues at autopsy and we may have no sequence data from these tissues at earlier sample times. The bias that was observed was effectively removed for low migration rates and reduced for large migration rates. At low migration rates estimation can be acceptable, while at high migration all we may be able to do is rule out low migration rates.
We have established that in general for the structured coalescent, we need to be accurate with the population structure to avoid misleading results. But we can't always be sure of what the “correct” model is. However, there is no available test that compares different island-population models. Surprisingly a direct Bayesian model averaging approach or even any coalescent based approach to this question has not been carried out anywhere in the field. We suggest Bayesian model averaging as a way of addressing this problem.
Bayesian Model Averaging (BMA) in a MCMC setting is implemented by adding a move that jumps between models with the appropriate acceptance probability (Hoeting et al. 1999). In effect, we treat the current model as part of the state space or as one of the estimated parameters. More complicated model averaging can allow moves between models with different numbers of parameters (Green, 1995). Generally these types of moves are more difficult to develop. One possible genealogy move for our case, is the move between panmictic and subdivided populations. To move from the subdivided population model to panmictic model, we simply remove the migration events and relable the leaf nodes to be in the same deme. However, the opposite move is more difficult. A number of legal migration events need to be generated over the genealogy and the leaves again relabelled. This move must be capable of producing every possible migration event pattern on the genealogy, and the probability must be calculable so the acceptance probability can be evaluated. This type of move is particularly generic in that moves between 2 to 3 demes and vice versa are also possible by first removing all the migration labels and just regenerating a legal migration history. The recolour tree move proposed in Ewing and Rodrigo (2006) does this and gives reasonable acceptance rates.
BMA provides more flexibility when compared to alternatives, for example, Bayes factors. When we wish to just have an estimate of the population size, we can marginalize the posterior distribution across models and get a combined estimate. The difference in bias between the ghost data estimated without the ghost deme, compared to estimation with a ghost deme when λ = 0 suggested that there are some distributional differences between the two cases. Thus BMA could possibly work well for this type of problem and is the next logical step for this method. We are currently working on a model averaging version of the code.
