Abstract
Abundance estimation, for both human and animal populations, informs policy decisions and population management. Capture-recapture and multiple sources data share a common structure; the population can be partially enumerated and individuals are identifiable. Consequently, the analytical methods were developed simultaneously. However, whilst ecological models have been developed to describe highly complex, biologically realistic scenarios, for example modeling population changes through time and combining different forms of data, multiple systems estimation has changed comparatively less so. In this paper we provide a brief description of the historical development of ecological and epidemiological capture-recapture and discuss the associated underlying differences that have led to model divergence. We identify three key areas where ecological modeling methods may inform and improve multiple systems estimation.
Introduction
Population assessment, management, and policy decision making rely on the robust and precise estimation of the total population size of the target population of interest. Stigmatized, threatened, cryptic or hidden populations are particularly difficult to assess due to their hard-to-reach nature. Whilst a complete census of a population is typically too expensive and impractical to undertake, observing part of the population, a partial enumeration, may be feasible. In an ecological setting, capture-recapture methods are often applied where individuals are observed through time on different capture occasions. For human populations, multiple systems estimation (MSE) is often performed using data from a number of different sources. Sources will vary depending on the target population of interest and a key concept of MSE is the ability to identify individuals across the sources. Typically these sources correspond to different data lists and will be dependent on the population under study. For example, data lists may include; hospital admissions, police records, and needle-exchange programmes (for injector related populations); border forces and records of non-governmental organisations (for human trafficking related populations); and humanitarian organisation records and death registries/exhumations (for war casualties). See Bird and King (2018) for further discussion and examples. Further, the data lists record individuals that are then uniquely identifiable using a combination of individual identifiers, such as name, date of birth, address, passport number, community health index (CHI) number (in the UK), or social security number (in the US). The underlying ideas in the data collection, for both capture-recapture and MSE, are the same (noting when or where individuals have been observed) and as a result the methods initially shared a simultaneous development. However, whilst ecological models have, for example, developed to incorporate complex structures for more realistic modeling of changes to the population through time, multiple systems estimation has continued to consider the population as a closed system, that the population is unchanging in its size during the data collection period.
MSE as a method for estimating the size of human populations has a long history. The earliest known application is generally attributed to Graunt in the 1660s for estimating the population of London, with Laplace applying a similar technique to estimate the population of France in the 1780s (Goudie and Goudie, 2007). Modern applications of MSE include for example, estimating the number of people who inject drugs (King et al., 2009, 2013, 2014), modern day slaves (Sharifi Far et al., 2020a; Silverman, 2014, 2020), homeless populations (Coumans et al., 2017), and the prevalence of human trafficking (Cruyff et al., 2017). However, many challenges still remain for MSE and its ability to provide robust estimates of population sizes. For example, Cruyff et al. (2020) demonstrate the importance of model selection on population estimates and the impact of the typically sparse data sets which arise; while Sharifi Far et al. (2020a) consider the robustness of the estimates when lists are omitted or combined. Reliable prevalence estimates are important to not only assess the extent of these hidden populations that lead to many societal problems, in addition to the impact on the individuals themselves, but also to be able to detect trends and/or assess policy impact. Advances in ecological capture-recapture methods have led to not only increased precision of population estimates, but also more intricate-level details being identified, including for example, parameters that were previously inestimable from traditional capture-recapture data. By considering some of these statistical advances within the ecological capture-recapture literature, we wish to apply similar rationale to MSE to provide improved prevalence estimates that can better inform policy.
Brief Historical Perspective
Capture-recapture methods, motivated by applications in ecology, started to gain traction toward the end of the 1900s and into the 20th century. In particular, they were developed to estimate the size of animal populations using data from two capture occasions (Lincoln, 1930; Petersen, 1896), leading to what is typically referred to as the Lincoln-Petersen estimator. We note that the early approaches used by both Graunt and Laplace are direct applications of this technique. This was followed by the more general
MSE and Capture-Recapture Synergies
The idea underlying MSE and capture-recapture is that if the population can be sampled repeatedly, either through time (typical for ecological data) or through different sources (typical for epidemiological data), then the information on when and/or where each observed individual was seen can be used to estimate the probability an individual is not seen. Hence, it is possible to estimate the number of missed individuals and the total population size. The number of unique observed individuals across all of the sources or occasions typically provides only a lower estimate for the total population size; there may be a substantial proportion of the total population not observed by any of the sources or on any occasion.
Data for MSE and capture-recapture can be expressed in the same format; through the recording of encounter histories. An example history might be,
indicating that this particular individual was observed by sources 2, 3, and 5 but missed by sources 1 and 4 (if considering sources of data), or that they were observed on occasion 2, 3, and 5 but missed on occasions 1 and 4 if considering sampling through time on different occasions. In general, suppose the total population size is given by
Note that the encounter histories in the above form within the MSE setting simply record whether an individual is seen, or not, by each source within a given time period. Information on whether individuals have been seen multiple times by a source (repeat sightings) and the order in which an individual was seen by different sources is not included. In general, time information specific to each individual is not retained within the data and does not feature in current models for MSE. To record and release such information may lead to confidentiality issues where individuals could potentially be identified due to their highly unique observation data. We discuss potential options for avoiding these confidentiality issues in Section 2.2.
Methods for both MSE and capture-recapture are generally based on two different statistical distributions: a Poisson model and a multinomial model. Chao et al. (2001) provides an excellent overview of the two modeling approaches. In addition to estimating the total population size
Fienberg (1972) and Cormack (1979) defined a Poisson random variable associated with each observed encounter history. Since a set of independent Poisson random variables leads to a multinomial distribution when conditioned on their sum, Sandland and Cormack (1984) showed that both modeling approaches lead to the same maximum likelihood estimates for the parameter of interest, the total population size
Both MSE and the models described above for ecological capture-recapture assume that the population is closed. This assumes there are no arrivals or departures from the population during the period over which the data are collected, equivalently that the individuals that form the population being sampled is unchanging. Under highly restrictive conditions, for example very short sampling periods, this assumption may be justifiable, but for many populations under study this is highly unlikely. Data for MSE is often aggregated by year, or perhaps longer, and so the definition of the total population size can be unclear. Assuming closure implies that all individuals were available for the whole sampling duration. Perhaps a more realistic count would be those individuals that were part of the population of interest at some point during the sampling period. This latter suggestion requires the possibility that individuals can enter and leave the population at any time. Capture-recapture models commonly work within this open population framework, for example, modeling survival or retention of individuals and explicitly modeling arrivals into the population (Cormack, 1964; Jolly, 1965; King, 2014; McCrea & Morgan, 2014; Newman et al., 2014; Pledger et al., 2009; Schwarz & Arnason, 1996; Seber, 1965; Worthington et al., 2019b).
Outline of Paper
In Section 2 we explore three developments from ecological capture-recapture models that may be used to inform and improve the estimation of population size through MSE. In Section 3 we discuss whether there are elements of MSE that could benefit capture-recapture methods, in particular the combining of different dependent sources of data and consider future developments in both areas.
Ecological Advances for Potential Application to MSE
In this section we describe three developments from ecological capture-recapture models and discuss their synergies with MSE. In particular, we discuss: the assumptions relating to interactions between different sources (or capture occasions); individual heterogeneity and the closure assumption, particularly when data are collected over an extended period of time; and the combining of different forms of data within a single coherent analysis.
Temporal and Behavioral Effects
Within ecological studies the capture occasions have a natural temporal order. This is in contrast to the analogous sources used within epidemiological MSE where the sources themselves have no natural order (the encounter histories would change if the sources were reordered). For individuals recorded by multiple sources the temporal information is not available from the contingency table. The presence or absence of the temporal component (for ecological and epidemiological studies, respectively) has a direct impact on the modeling of the data and associated interpretation of the model parameters. However, there remains some commonality and interesting comparisons, motivating further useful avenues of research.
For ecological capture-recapture studies, the model is typically parameterised in terms of the (direct) probabilities of observing an animal on a given capture occasion conditional on its capture history to date (Borchers et al., 2002; McCrea & Morgan, 2014). These time-dependent capture probabilities are combined to form the associated probability of each observed encounter history (equivalently the probabilities associated with each cell of the contingency table). For example, for encounter history
In general, even when the study design is specified to minimize the variability of capture across capture occasions, the capture probabilities may still vary by occasion. This may be due to changing weather conditions, or changing behavior of the individuals over time due to breeding behavior etc. In this case of time-dependent capture probabilities, assuming that the capture probabilities are common to all individuals so that there is no additional individual heterogeneity to consider and that the capture probabilities across capture occasions are independent, we obtain the time-dependent model denoted by
In practice, it is often the case that the capture probabilities are not independent across the different occasions. In particular, we may have behavioral effects where the capture of an individual influences its future capture probabilities (Otis et al., 1978). This is typically referred to as behavioral effects which may correspond to either: (i) a “trap happy” response, where the future recapture probability of an individual is increased following its initial capture (this may occur for example, if food is provided to captured individuals); or (ii) a “trap shy” effect, where the future capture probability of an animal is decreased following its initial capture (for example, the trapping and tagging of an animal may be an unpleasant and stressful experience, as a result the individual may identify and avoid future traps). The simplest behavioral model is denoted
We initially consider the behavioral response such that the capture of an individual influences all future capture occasions. In other words, an individual initially captured on occasion
However, there is a fundamental difference between the ecological behavioral models and the two-way interaction log-linear models with important knock-on effects and interpretations. In particular, the behavioral response in the ecological models is a “forward” or “directional” interaction only—for example, the probability of being observed at time
The comparison of log-linear models with the ecological behavioral models raises some interesting perspectives. In many cases, an individual observed by one source may be referred onwards to another source(s). For example, non-governmental organisations may pass on details of individuals to police who then also identify the same individuals when investigated further; however police may not refer individuals to non-governmental organizations. Such a process describes a directional interaction. Standard log-linear models are unable to formally model such a process (all interactions are symmetric as there is no temporal or referral information); and not incorporating these mechanisms can lead to poor performance (Jones et al., 2014). The ecological capture-recapture models thus potentially motivate the inclusion of temporal information within multiple-source data, thus permitting the development of models with directional interactions for MSE.
Open Population Models
The models discussed in the previous sections assume that the population being estimated is closed. The estimate of the total population is therefore a “snapshot” estimate assuming that individuals did not leave the population (due to death, migration or no longer being a member of the target population) nor did new individuals join the population (birth, migration or becoming a member of the target population). Whilst policy makers appear to prefer “snapshot” estimates, the estimation of the population size through time may be more informative by identifying changes occurring within the population.
For example, suppose a contingency table summarises the data collected over a 2-year period by multiple sources. The traditional MSE estimate for
Many of the standard open population capture-recapture models, in additional to estimating capture probabilities, estimate apparent survival, or retention probabilities. These parameters express the probability that an individual currently in the population on occasion
If similar multi-state models were to be applied in an MSE setting, then time information would be required. The progression of states from not in the population, to joining the population, to leaving the population, occur in a natural order; it is simply the timing of the transitions that is uncertain. The extra information that would be required could however lead to more informative investigation of the population. For instance, if the arrival and departure time of individuals can be estimated, then the amount of time individuals spend in the population can also be estimated and time spent in the population could inform the probability of capture by a source. If the states refer to the sources that have captured an individual then the transitions between states could model resighting at a source that has already recorded the individual or capture by a further source. This could open up possibilities to identify the expected time gap between sources and potential referrals between sources.
The data required for time-dependent modeling in an MSE setting may be difficult to obtain. To model transitions between sources it is possible that very large datasets would be required in order to obtain a sufficient number of observations of the different orderings of sources—a problem that would increase significantly with the number of sources used. The largest issue will be in protecting individual identities. By simply retaining the sources that have observed an individual there is a reasonable degree of anonymity. Unless there are very small cell counts individuals will not be identifiable. However, if highly specific covariate information were collected, such as the time of observation by a source, then there is the potential for individuals to be identified. This may be mitigated by instead assigning an arbitrary “time 0” and recording the time gap between observations. The models described here operate in discrete time, and so further anonymity may be achieved by careful selection of the discretisation, though again large datasets may be required in order to have several individuals identified in any one discrete time period.
Integrated Modeling
Integrated population modeling in ecology refers to the combined analysis of multiple data types. The concept was first proposed in Besbeas et al. (2002), where ring-recovery data modeled using a product multinomial likelihood was analysed in conjunction with population counts (or population index data) described using a state-space model. This was the first time two disparate modeling approaches were unified into a single analysis. The global model describing both types of data simultaneously requires the assumption of independence of the data as the global likelihood function is formed as the product of component likelihoods. Although some concern is raised regarding the validity of this assumption it has been found that violation of this model assumption does not result in appreciable bias in the estimators (see for example, Abadi et al., (2010) and Besbeas et al., (2009)). One of the benefits of analysing disparate data sets simultaneously is that you can obtain improved precision of some parameters. This is particularly noticeable in the case of multi-state models where estimates of transition probabilities are often associated with large uncertainty, and the addition of state-specific population counts modeled using the state-space framework improves the precision (McCrea et al., 2010).
It is also the case that it is not possible to estimate certain parameters from a single source of data due to parameter redundancy (Cole & McCrea, 2016; Sharifi Far et al., 2020b; Vincent et al., 2020). For example if just census counts are available you cannot separate the estimation of fecundity or productivity and first year survival. Therefore, by analysing the census data in conjunction with another source of data such as ring-recovery data which contains information on survival, it is then possible to estimate both survival and fecundity.
Within the MSE models there are two parameters: the unknown population size and the capture probabilities. Therefore if other sources of data might provide additional information on capture probability this will result in better estimates of both parameters (due to the correlation of the parameters improvement in precision of capture probability will result in improved precision of
Conclusions and Further Directions
In this paper we have discussed similarities between ecological capture-recapture studies and epidemiological MSE; and focused on three key areas in which capture-recapture methods may inform and improve MSE analyses. Whilst sharing a similar model structure both capture-recapture and MSE can generally be criticized for the assumptions they make. Broadly speaking, MSE ignores temporal information and assumes a static population size whilst capture-recapture ignores potential dependence of observations.
Capture-recapture methods including temporal effects or open population models offer an opportunity for more realistic modeling of the population being counted through time. The incorporation of time into MSE analyses would require a different data structure to the summarized contingency tables that are currently typical. Individual specific information would need to be retained and the issues surrounding the protection of identity would need careful consideration. The benefits of the increased understanding of the dynamics of the population however may be significant.
An advantage of MSE analyses, that may be relevant to capture-recapture, is the dependence between the sources of data. This aspect is readily accounted for within the model using two-way, or higher, interactions. The interpretation of these interaction terms is not as readily understood in the case of multiple sampling occasions through time. However, surveying of animals can take different forms of which capture-recapture data is only one. It can be advantageous to include multiple forms of information within a single analysis, fitting the model within a single framework. Integrated modeling includes an assumption of independence between the sources, but an approach where this can be relaxed may be preferable and MSE could inform this approach. A potential application is to the analysis of migration data. If there are multiple sites that a species may attend, the choice of site, or the sites an individual is seen at may be influenced by the combinations of sites themselves. Including dependence between the sources, in this case sites, may allow for instance the modeling of related increases/decreases in sightings between sites (similar sites influencing each other for example).
Many capture-recapture studies are repeated annually with data then available across multiple years. Models exist that do not only consider a single-year of data but instead operate on two time scales; a primary level scale operating across several years, and a secondary scale operating within a single year. Many MSE analyses involve data collected over several years, aggregated by year. There may be scope for these capture-recapture models to be applied to MSE data, where individuals could be tracked through years as well as across sources. Capture-recapture data, in addition to time, may also include spatial information on the location a capture occurred. Links between the non-independence of the sources in MSE with the spatial density of a species might be an interesting avenue for further consideration.
There is clearly potential for the two academic communities from ecological statistics and MSE to collaborate to maximise the potential of the information contained in respective data sets.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Author Biographies
Email:
Email:
Email:
Email:
