Sage Journals: Discover world-class research

Abstract

If part of a population is hidden but two or more samples are available that each cover parts of this population, multiple systems estimation can be applied to estimate the size of this population. A problem is that these estimates suffer from finite-sample bias that can be substantial in case of a small sample or a small population size. This problem was recognized by Chapman, who derived his essentially unbiased Chapman-estimator for two samples. Because more than two samples may be required to correct for sample dependence, we propose a Generalized Chapman-estimator that can be applied with any number of samples. In a Monte Carlo experiment, this new estimator shows hardly any bias and has smaller standard errors than competing bias-reduced estimators. It is also compared to the usual maximum likelihood estimates for the case of estimating the number of homeless people in the Netherlands, where it shows notably different outcomes.

Keywords

finite-sample bias log-linear model multiple systems estimation Chapman-estimator

1. Introduction

Estimating the size of a closed population, which is partly observed through a set of incomplete samples that each cover parts of this population, is an important statistical problem. The estimation method for this problem is known as multiple systems estimation (MSE), but also as capture-recapture, mark-recapture, or multiple-recapture estimation. An overview of its history and applications is provided by for example, Bird and King (2018), Cormack (1989), and International Working Group for Disease Monitoring and Forecasting (1995). The most basic MSE estimator was proposed by Petersen (1896) and later Lincoln (1930) and is therefore known as the Lincoln-Petersen (LP) estimator. The LP-estimator is based on two random population samples, and under a set of assumptions discussed in Wolter (1986), it provides an asymptotically unbiased estimator for the size of the population.

Two of the main assumptions are the assumption of sample independence and of homogeneous inclusion probabilities of population units. A standard approach that aims to correct for this bias is the incorporation of additional samples and covariates in the MSE model (see, e.g., Bishop et al. 1975). Fienberg (1972) showed that the log-linear model provides an easy and flexible way of incorporating multiple samples, covariates, and relations between them into MSE. The standard approach of estimating the parameters of a log-linear model, is by maximum likelihood (ML). A well-known problem of ML estimates of log-linear model parameters is their bias in finite samples (see e.g., Hald 1952, chap. 7, or Miller 1984), which can be substantial in case of small samples (Long 1997, 53–4; Rainey and McCaskey 2021). We should note that this “bias” refers to mean-bias and not median-bias, because median-unbiasedness is not affected by non-linear transformations (see, e.g., Hald 1952; Kosmidis et al. 2020, for further discussion). In fact, the reduction of mean-bias often leads to the introduction of median-bias, so there is a trade-off involved. However, as discussed by for example, Kosmidis and Firth (2021) and Rainey and McCaskey (2021), another advantage of mean-bias reduction, is that it also reduces the variance and therefore the accuracy of the ML-estimator and so it is generally desirable.

For the LP-estimator bias was already discussed by Chapman (1951) and Bailey (1951), who each propose their own alternative estimator with improved finite-sample properties. Chapman (1951, 147) showed that his estimator is “essentially unbiased” and so his Chapman-estimator became the standard bias-corrected version of the LP-estimator. A problem with both the Chapman- and Bailey-estimator is that they are designed only for two samples, so it is unclear how they can be applied in case there are more than two samples. This is important, because to paraphrase Tilling (2001), bias in MSE estimates is more likely if one sample (or a combination of samples and/or categorical covariates) contains very few records, which becomes more likely in case more samples (or covariates) are involved. Chapman (1952) himself extended his estimator to more than two samples, but only for the case where a unit was tagged in an earlier sample or not, and therefore it cannot be used with a log-linear model that contains sample dependencies, which requires the availability of complete inclusion patterns. Evans and Bonett (1994) and Rivest and Lévesque (2001) propose bias reduction methods specifically for MSE that are, as in Firth (1993), Kosmidis and Firth (2011), and Kosmidis et al. (2020), based on modified score-functions. They propose modification schemes for the log-linear models by Otis et al. (1978), which correspond to a selection of log-linear model specifications (Chao 2001).

In this article, we derive a generalization of the Chapman-estimator, which can be applied with any log-linear model specification. For two samples our Generalized Chapman-estimator is, just like the estimator by Rivest and Lévesque (2001), equivalent to the Chapman-estimator and for more than two samples it differs from the estimator that is proposed by Rivest and Lévesque. To derive this new estimator, the next section introduces some notation and discusses the relation between MSE and the log-linear model, the problem of bias, and bias correction.

2. Multiple Systems Estimation Theory

MSE considers a closed population of size $N$ and a sequence of $k > 1$ incomplete samples, indicated by $A, B, C, \dots, K$ , that each partly cover this population. The members in each sample are uniquely labelled, the size of samples is not fixed in advance and samples may be mutually dependent. Together these samples identify $n < N$ unique members of the population. For ease of notation, where possible, the case of two or three samples is discussed, because it can often be generalized to any number of samples in a straightforward way. The probability that a member of the population is included in sample $A$ and $B$ is denoted as $p_{a}$ and $p_{b}$ . The unique labelling gives each member of the population an inclusion history over samples, which can be denoted as a binary string of length $k$ . For two samples this is $ab$ , with $a, b \in (1, 0)$ , where $a = 1$ means that the individual is in sample $A$ and $a = 0$ means he is not; $a = +$ indicates the sum over $a = 0$ and $a = 1$ . The same holds for $b$ . The count of members with inclusion history $ab$ is denoted as $n_{ab}$ . Then, for example, $n_{1 +}$ is the size of sample $A$ and $n_{11}$ is the amount of members that are present in both samples. $n$ is a $(2^{k} - 1) \times 1$ vector that contains the complete set of observed counts. For two samples this is $n = {(n_{11}, n_{10}, n_{01})}^{⊤}$ . The total population size $N$ is equal to $n + n_{00}$ where $n$ equals $\sum_{ab} n$ and $n_{00}$ is the unobserved part of the population. Finally, MSE assumes that $n_{ab}$ follows a multinomial distribution (Darroch 1958) with expectations $m_{ab}$ , and the aim is to estimate $m_{00}$ and $N$ .

2.1. The Saturated Log-Linear Model

Fienberg (1972) showed that MSE can be described by a log-linear model, that for two samples is written as

\log m_{ab} = λ + λ_{a}^{A} + λ_{b}^{B} + λ_{ab}^{AB},

(1)

where $λ$ is an intercept term, $λ_{a}^{A}$ and $λ_{b}^{B}$ are inclusion parameters for sample $A$ and $B$ , and $λ_{ab}^{AB}$ is a two-way interaction parameter for sample $A$ and $B$ . A $λ$ -parameter is equal to zero when either $a$ and/or $b$ in the subscript is $0$ . Furthermore, for model identification it is generally assumed that $λ_{ab}^{AB} = 0$ (i.e., sample independence), which implies that Equation (1) can be reduced to $m_{00} = m_{10} m_{01} / m_{11}$ . The conditional ML estimates (see e.g., Sanathanan 1972) for $m_{ab}$ are the corresponding $n_{ab}$ , so conditional ML gives the well-known Lincoln-Petersen estimator

{\hat{m}}_{00}^{LP} = n_{10} n_{01} / n_{11} .

(2)

Asymptotic consistency of the LP-estimator relies on the assumption of independence between sample $A$ and $B$ , which may be unrealistic in many applications. A solution is to use three (or more) samples, because then the independence assumption $λ_{ab}^{AB} = 0$ is no longer required for identification. The log-linear model for three samples is

\log m_{abc} = λ + λ_{a}^{A} + λ_{b}^{B} + λ_{c}^{C} + λ_{ab}^{AB} + λ_{ac}^{AC} + λ_{bc}^{BC} + λ_{abc}^{ABC} .

(3)

In this case of three samples, for identification it is generally assumed that the three-way interaction parameter $λ_{abc}^{ABC} = 0$ , so the two-pair interaction parameters $λ_{ab}^{AB}$ , $λ_{ac}^{AC}$ , and $λ_{bc}^{BC}$ can be estimated. In general, for $k$ samples, a saturated model is obtained by assuming that the $k$ -way interaction parameter is equal to zero. Fienberg (1972) shows that for a saturated model with three samples the conditional ML-estimator for $m_{000}$ is

{\hat{m}}_{000}^{SAT, ML} = n_{111} n_{100} n_{010} n_{001} / n_{110} n_{101} n_{011},

(3)

and for $k$ samples

{\hat{m}}_{00 \dots 0}^{SAT, ML} = Π n_{odd} / Π n_{even},

(4)

where $n_{odd}$ is a $(2^{k - 1}) \times 1$ vector and $n_{even}$ is a $((2^{k - 1}) - 1) \times 1$ vector with $n = {n_{odd}, n_{even}}$ , with $n_{odd}$ and $n_{even}$ representing the sets in $n$ with inclusion histories with an odd and even number of inclusions respectively.

2.2. The Chapman-Estimator and a Generalization for Saturated Models

The conditional ML-estimators discussed in the previous section suffer from finite-sample bias (see e.g., Hald 1952, chap. 7, or Miller 1984). A straightforward example is provided by the LP-estimator in Equation (2), which has no finite expectation because there is a non-zero probability of $n_{11} = 0$ . This bias in the LP-estimator was discussed by Chapman (1951) who derived a Chapman-estimator for $m_{00}$ and $N$ , that is,

{\hat{m}}_{00}^{Chap} = n_{10} n_{01} / (n_{11} + 1), {\hat{N}}^{Chap} = n + {\hat{m}}_{00}^{Chap} .

(5)

Chapman (1951, 146) shows that if the two samples are large enough compared to $N$ , the bias in ${\hat{N}}^{Chap}$ is less than $ϵ$ (Cramer 1922, 502) percent of $N$ and so Chapman’s estimator is “essentially unbiased.” This is further illustrated by Rivest et al. (1995) who show that the bias in the Chapman-estimator can be approximated by

E [{\hat{N}}^{Chap} - N] \approx E [n] \frac{(1 - p_{a}) (1 - p_{b})}{1 - (1 - p_{a}) (1 - p_{b})} {(1 - \frac{p_{a} p_{b}}{1 - (1 - p_{a}) (1 - p_{b})})}^{E [n] - 1},

(6)

which converges to zero for large enough $E [n]$ , $p_{a}$ , and $p_{b}$ .

A reason why Chapman’s derivation is powerful is because Chapman uses the inverse factorial approximation, which for the function $E [1 / n_{11}]$ is recommended by Stephan (1945), to approximate the bias expression. More common is to use a Taylor expansion, as was done by Bailey (1951), who as a result proposes his own slightly different bias-reduced estimator for two samples, that is, ${\hat{m}}_{00}^{Bailey} = n_{10} (n_{01} - 1) / (n_{11} + 1)$ . In this case the inverse factorial approximation has the advantage that to achieve convergence, it requires less expansion terms and is therefore more accurate than the Taylor series expansion (see the Supplemental Material for an illustration). The superiority of the Chapman-estimator over the Bailey-estimator is illustrated in the simulation study presented in Subsection 3.1.

The Chapman-estimator can be derived by assuming that the set $(n_{11}, n_{10}, n_{01}, n_{00})$ follows a multinomial distribution but also by assuming that each $n_{ab}$ follows a Poisson distribution (proof available as Supplemental Material). The Poisson distribution is useful because it simplifies extending the Chapman-estimator toward more than two samples, because when the Chapman-estimator is asymptotically unbiased under Poisson sampling, this implies $m_{ab} = E [n_{ab}]$ and

1 / m_{ab} \approx E [1 / (n_{ab} + 1)] .

(7)

Under Poisson sampling, combining Equation (7) with Equation (4) gives a new bias-corrected estimator for $m_{00 \dots 0}$ in case of any number of samples and a saturated log-linear model, that is,

{\hat{m}}_{00 \dots 0}^{SAT, Chap - k} = Π n_{odd} / Π (n_{even} + 1)

(8)

For the saturated model with three samples as defined in Equation (3), Equation (8) gives

{\hat{m}}_{000}^{SAT, Chap - 3} = n_{111} n_{100} n_{010} n_{001} / ((n_{110} + 1) (n_{101} + 1) (n_{011} + 1)) .

(9)

This estimator can also be obtained with the modified-score function approach, by replacing $n$ with the modified

n^{SAT, Chap - 3} = {(n_{111}, n_{110} + 1, n_{101} + 1, n_{011} + 1, n_{001}, n_{010}, n_{100})}^{⊤} .

The idea of adding 1 to each variable in the denominator was also suggested in a slightly different context by Jewell (1986), who was interested in an estimate of the odds ratio of a $2 \times 2$ table, with the four elements resulting from a multinomial distribution. Jewell concludes that $ad / (b + 1) (c + 1)$ (with $a, b, c, d$ the four elements of the $2 \times 2$ table) is a preferable estimator for the odds ratio with small samples.

The Generalized Chapman-estimator in Equation (8) is a bias-corrected estimator for saturated models. The next section shows that if an unsaturated model is chosen, this leads to a different optimal modification scheme and therefore a different bias-corrected estimator.

2.3. Unsaturated Log-Linear Model Specifications

A log-linear model (LLM) for any number of samples and parameters can be written as

\log (m) = X^{LLM} λ^{LLM},

(10)

with $m$ a $(2^{k} - 1) \times 1$ vector of expectations that correspond to the $n_{ab \dots k}$ in $n$ . $X^{LLM}$ is a $(2^{k} - 1) \times l$ design matrix and $λ^{LLM}$ a $l \times 1$ vector with $l$ log-linear model $λ$ -parameters. How $X^{LLM}$ and $λ^{LLM}$ are specified is indicated by the superscript LLM. When Equation (10) constitutes a saturated model, it contains $l = 2^{k} - 1$ parameters, but the saturated model is not the only model that may be chosen. For instance, when the samples $A$ and $C$ are conditionally independent, the parameter $λ_{ac}^{AC}$ may be left out of the model. This gives an unsaturated log-linear model that has the advantage that a resulting estimator has reduced variance (Fienberg 1972, 598). The question is whether and how this affects the optimal bias-correction modification scheme.

Fienberg (1972) discusses three typical unsaturated models for the case of three samples, which are illustrative for our purpose. We refer to them as the two-pair dependence (2PD), the one-pair dependence (1PD), and independence (IND) model, which are

2 PD : \log m_{abc} = λ + λ_{a}^{A} + λ_{b}^{B} + λ_{c}^{C} + λ_{ab}^{AB} + λ_{bc}^{BC},

(11)

1 PD : \log m_{abc} = λ + λ_{a}^{A} + λ_{b}^{B} + λ_{c}^{C} + λ_{ab}^{AB},

(12)

IND : \log m_{abc} = λ + λ_{a}^{A} + λ_{b}^{B} + λ_{c}^{C} .

(13)

The 2PD and 1PD model may also be formulated with different pairwise interaction terms, but because we can change the order of samples without loss of generality, we can limit ourselves to these three models. For the 2PD and 1PD model, Fienberg (1972, 596) derives closed form expressions of the conditional ML-estimators for $m_{000}$ , which are

{\hat{m}}_{000}^{2 PD, ML} = n_{100} n_{001} / n_{101},

(14)

{\hat{m}}_{000}^{1 P D, M L} = n_{001} n_{+ + 0} / (n_{111} + n_{101} + n_{011}) .

(15)

Fienberg (1972) further shows that for the independence model a closed form expression of ${\hat{m}}_{000}^{IND, ML}$ does not exist.

The optimal modification schemes for the estimators in Equations (14) and (15) can be easily derived when Equation (7) is considered. $n_{ab}$ in Equation (7) represents any Poisson distributed variable, so it can be replaced with the Poisson variables $n_{101}$ or $n_{111} + n_{101} + n_{011}$ , which for Equations (14) and (15) gives

{\hat{m}}_{000}^{2 PD, Chap - 3} = n_{001} n_{100} / (n_{101} + 1),

(16)

{\hat{m}}_{000}^{1 PD, Chap - 3} = n_{001} n_{+ + 0} / (n_{111} + n_{101} + n_{011} + 1)

(17)

as bias-corrected estimators for the 2PD and 1PD model. This shows that in case of the 2PD model the optimal modification scheme for the saturated model is overdone but not incorrect, because $n_{101}$ is modified in the same way and modifying $n_{110}$ and $n_{011}$ simply does not affect the 2PD model ML-estimator. For the 1PD model however, the optimal modification scheme as compared to the saturated model has changed, because the optimal modification scheme for the saturated model would have given

{\hat{m}}_{000}^{1 PD} = n_{001} (n_{+ + 0} + 1) / (n_{111} + n_{101} + n_{011} + 2) .

as a bias-corrected estimator, which differs from the optimal ${\hat{m}}_{000}^{1 PD, Chap - 3}$ in Equation (17).

The derivation of the bias-corrected estimators in Equations (8), (16), and (17) requires closed form expressions of the ML-estimators for $m_{00 \dots 0}$ and $m_{000}$ for the chosen log-linear models. The fact that for the IND model in Equation (13) such a closed form expression does not exist, shows that this approach cannot be generalized toward any log-linear model specification. Therefore, in the next section we propose a different derivation that does not depend on closed form expressions, also leads to the bias-corrected estimators in Equations (8), (16), and (17), and can be easily extended toward unsaturated models with more than three samples.

2.4. The Generalized Chapman-Estimator

The previous section showed how a closed form expression of the ML-estimator for $m_{000}$ can be used to derive an optimal modification scheme. However, for an optimal modification scheme it is enough to know which (function of) elements of $n$ are in the denominator of this closed form expression. For example, to obtain an optimal modification scheme for the 1PD model, it is enough to know that $n_{111} + n_{101} + n_{011}$ is in the denominator of the closed form expression. Therefore, an alternative is to obtain an estimator for $m_{000}$ with the least squares (LS) approach. This is useful, because as Frome et al. (1973) show, the ML-estimator can be formulated as a (properly weighted) LS-estimator and therefore has a structure that is similar to the structure of the ML-estimator. This can be seen in the general expression for the LS-estimator for $m_{00 \dots 0}$ , which is

{\hat{m}}_{00 \dots 0}^{LLM, LS} = \underset{ab \dots k}{Π} {(n_{ab \dots k})}^{z^{LLM}},

where $z^{LLM}$ is the first row of the well-known Moore-Penrose inverse (Moore 1920; Penrose 1955) denoted as

Z^{LLM} = {({(X^{LLM})}^{⊤} X^{LLM})}^{- 1} {(X^{LLM})}^{⊤} .

Because the rank of $X^{LLM}$ and therefore $Z^{LLM}$ is smaller than or equal to $2^{k} - 1$ , it will always provide a closed form expression of the LS-estimator for ${\hat{m}}_{00 \dots 0}$ . The relation between the LS- and ML-estimator implies that $z^{LLM}$ indicates which elements of $n_{ab \dots k}$ are in the nominator and denominator of a closed form expression of the ML-estimator, given that it exists. For example, for the 1PD model the LS-estimator is

{\hat{m}}_{000}^{1 PD, LS} = \frac{n_{001} {(n_{110} n_{100} n_{010})}^{1 / 3}}{{(n_{111} n_{101} n_{011})}^{1 / 3}} .

This LS-estimator resembles the ML-estimator for the 1PD model as given in Equation (15), in the sense that they are both fractions of means over the same elements of $n$ . The difference is that the LS-estimator contains geometric means and the ML-estimators contains arithmetic means. This relation also holds for the 2PD model and the saturated model with any number of samples. We suspect that this relation holds for all unsaturated model for which a closed form expression of the ML-estimator for ${\hat{m}}_{00 \dots 0}$ exists, but a formal proof is beyond the scope of this paper.

The relation between the LS- and ML-estimator for the SAT, 2PD, and 1PD model can, by using $z^{LLM}$ , be further extended toward the corresponding bias-corrected estimators in Equations (8), (16), and (17), which suggests that $z^{LLM}$ may contain the solution for an optimal modification scheme. First note that a negative value in $z^{LLM}$ indicates that the corresponding $n_{ab \dots k}$ is in the denominator of the LS-estimator. Furthermore, the negative values in $z^{LLM}$ do not only indicate which $n_{ab \dots k}$ are in the denominator, but they also indicate their weight. For example, for the 1PD model, $z^{LLM}$ contains three times a $- 1 / 3$ in the cells that correspond to $n_{111}$ , $n_{101}$ , and $n_{011}$ (see Table 1), while they each have a weight of one third in the denominator of the closed form expression of the ML-estimator. This means that subtracting $- 1 / 3$ from these $n_{abc}$ gives the Generalized Chapman-estimator for the 1PD model as given in Equation (17). A general expression for this modification is

Table 1.

The Value of $z^{LLM}$ and $ω^{LLM}$ for Each LLM.

(a) $m$	$z^{SAT}$	$z^{2 PD}$	$z^{1 PD}$	$z^{1 PD}$
$m_{111}$	1	0	−1/3	−1/2
$m_{110}$	−1	0	1/3	0
$m_{101}$	−1	−1	−1/3	0
$m_{011}$	−1	0	−1/3	0
$m_{100}$	1	1	1/3	1/2
$m_{010}$	1	0	1/3	1/2
$m_{001}$	1	1	1	1/2
(b) $m$	$ω^{SAT}$	$ω^{2 PD}$	$ω^{1 PD}$	$ω^{IND}$
$m_{111}$	0	0	1/3	1/2
$m_{110}$	1	0	0	0
$m_{101}$	1	1	1/3	0
$m_{011}$	1	0	1/3	0
$m_{100}$	0	0	0	0
$m_{010}$	0	0	0	0
$m_{001}$	0	0	0	0

n^{LLM, Chap - k} = n + ω^{LLM},

(18)

where $ω^{LLM}$ is equal to $(- z^{LLM})$ with all elements that are larger than zero in $z^{LLM}$ , set to zero (see Table 1b for examples). When $n$ is replaced by $n^{LLM, Chap - k}$ before estimation, the Generalized Chapman-estimator is obtained by

{\hat{m}}_{00 \dots 0}^{LLM, Chap - k} = \exp ({\hat{λ}}^{LLM, Chap - k}),

(19)

with ${\hat{λ}}^{LLM, Chap - k}$ the ML-estimate given the modified $n^{LLM, Chap - k}$ for the intercept term $λ$ . To illustrate, for the models SAT, 2PD, 1PD, and IND in Equations (3), (11), (12), and (13). The $z^{LLM}$ and $ω^{LLM}$ are given in Table 1. If the modification schemes $ω^{SAT}$ , $ω^{2 PD}$ , and $ω^{1 PD}$ in Table 1b are applied to $n$ according to Equation (18), Equation (19) gives the bias-corrected Chapman-estimators in Equations (9), (16), and (17) and for any saturated log-linear model this procedure leads to Equation (8). Finally, the column $ω^{IND}$ also gives a modification scheme for the IND model, for which it cannot be mathematically confirmed whether it leads to an optimal bias-corrected estimator, because a closed form expression of ${\hat{m}}_{000}^{IND, ML}$ does not exist. Therefore the performance of this modification scheme will be further investigated in a simulation study in Subsection 3.2.2.

Finally, in this section the design matrix $X^{LLM}$ is a design matrix that only concerns the design of samples, while it may also contain the design for categorical covariates. If the design of categorical covariates is part of $X^{LLM}$ , the modified $n^{LLM, Chap - k}$ should be calculated for each covariate category separately. For instance, when there are two groups $g_{1}$ and $g_{2}$ , that each have their own log-linear model specification, equation Equation (18) should be applied on the $n_{g_{1}}^{LLM, Chap - k}$ and $n_{g_{2}}^{LLM, Chap - k}$ separately.

2.5. Other Bias-Reduced MSE Estimators

Evans and Bonett (1994) and Rivest and Lévesque (2001) propose bias reduced MSE-estimators for log-linear models with more than two samples. They are, just like $n^{LLM, Chap - k}$ in Equation (18), based on the modified score-function approach. Evans and Bonett (1994) propose a very simple modification scheme that implies adding $0 . 5^{k - 1}$ to each element in $n$ before estimation. This has the advantage that it can be easily applied with any log-linear model specification. In case of the two sample model in Equation (1), this gives $n^{EB} = (n_{11} + 0.5, n_{10} + 0.5, n_{01} + 0.5)$ , which differs from $n^{Chap}$ and is therefore not optimal, which is also shown in the simulation study presented in Subsection 3.1. Interestingly, for this simple case the modification scheme by Evans and Bonett is equivalent to the modification schemes proposed by Firth (1993), Kosmidis and Firth (2011), and Kosmidis et al. (2020), which are designed to obtain bias-reduced estimators for $λ$ , $λ_{a}$ , and $λ_{b}$ . An optimal modification scheme to reduce bias in ML estimates of $λ$ -parameters is not optimal for ML estimates for $m_{00 \dots 0} = \exp (λ)$ , and therefore the modification scheme of the EB-estimator is probably not optimal either. This is also shown by Rivest and Lévesque (2001), who show in a Monte Carlo experiment that for the set of log-linear models defined by Otis et al. (1978), the modification scheme by Evans and Bonett reduces the bias in the population size estimate for some models but also worsens it for others.

Rivest and Lévesque (2001) propose a more sophisticated MSE modification scheme that is, for the two sample model in Equation (1), equivalent to the Chapman-estimator. In a Monte Carlo simulation study they show that their RL-estimator has less bias than the EB-estimator for all the models by Otis et al. (1978). For more than two samples, the modification scheme by Rivest and Lévesque (2001) differs from the modification scheme proposed in Equation (18). This is clearly shown by the difference in $n^{IND, Chap - 3}$ and $n^{M_{t}, RL}$ , which for three samples concern modified $n$ for the independence model. They are

n^{IND, Chap - 3} = {(n_{111} + 1 / 2, n_{110}, n_{101}, n_{011}, n_{100}, n_{010}, n_{001})}^{⊤},

and (see Rivest and Lévesque 2001, 562)

n^{M_{t}, RL} = {(n_{111}, n_{110} + 1 / 3, n_{101} + 1 / 3 n_{011} + 1 / 3, n_{100} + 1 / 6, n_{010} + 1 / 6, n_{001} + 1 / 6)}^{⊤} .

Rivest and Lévesque also develop a modification scheme for the so-called $M_{th}$ model, which can be compared to the saturated model, but with the restriction $λ_{ab}^{AB} = λ_{ac}^{AC} = λ_{bc}^{BC}$ . For three sources, the modification scheme that belongs to this $M_{th}$ model becomes

n^{M_{th}, RL} = {(n_{111}, n_{110} + 2 / 3, n_{101} + 2 / 3, n_{011} + 2 / 3, n_{100}, n_{010}, n_{001})}^{⊤} .

For other unsaturated models, such as the 2PD and 1PD model, Rivest and Lévesque do not specify a specific modification scheme, therefore in the simulation studies in the next section, in such cases we use the modification scheme for model $M_{th}$ .

3. Simulation Studies

In this section the generalized Chapman-estimator is compared in three Monte Carlo simulation studies, that each considers a set of different scenarios. The scenarios differ from each other by $N$ , $p_{a}, \dots, p_{k}$ and different pairwise interaction terms that are defined by odds ratios denoted as $θ^{AB}$ , $θ^{AC}$ , etc. The method to generate the samples is developed and further explained in Hammond et al. (2024). This method allows the generation of $n_{ab}$ , $n_{abc}$ , or $n_{abcd}$ from a multinomial distribution that fit a log-linear model that has prespecified inclusion probabilities and odds ratio(s). In each simulation study the ML-estimator is compared with a set of bias-corrected estimators that were discussed in the previous sections, including the Chapman- or the Generalized Chapman-estimator.

The first simulation study is straightforward and only considers two samples and seven scenarios. The second and third simulation study both consider three and four samples in fifteen scenarios. The difference between the second and the third simulation study is the log-linear model that is used for estimation. In the second study the population size estimates are obtained from an assumed saturated log-linear model as in Equation (3), and in the third study the population size estimates are obtained under the assumption of the correct log-linear model specification (see Table 4). The use of the correct model specification may not be realistic in practical applications, but the results of this study show how the Generalized Chapman-estimator performs in case of other unsaturated models.

We should note that when a model contains redundant parameters as occurs with the saturated model in the second study, the resulting estimates have a larger variance (Bishop et al. 1975, 242). This additional variance does in itself not lead to biased estimates, but it does inflate bias. This inflationary effect can be seen when the bias is written as $(\sum_{r = 1}^{R} {\hat{m}}_{000, r} + n_{r}) / R - N = (\sum_{r = 1}^{R} \exp ({\hat{λ}}_{r})) / R - \exp (λ)$ , with $r = 1, \dots, R$ and R as the number of replications. When there is some positive bias in the estimate for $λ$ , a larger variance in $\hat{λ}$ leads to a further increase of $(\sum_{r = 1}^{R} \exp ({\hat{λ}}_{r})) / R$ and therefore inflates the bias.

Another minor but important simulation issue is what Otis et al. (1978, 125) refer to as “failures.” A simple example of a failure is the LP-estimator with $n_{11} = 0$ , which leads to ${\hat{N}}^{LP} = \infty$ . Otis et al. (1978) recommend replacing such a replication with a new replication or ignoring it, an advice that was followed in Evans and Bonett (1994) and Rivest and Lévesque (2001). However, only replacing/ignoring failures that correspond to relatively large population size estimates, introduces selection bias in the sense that, when ${\hat{N}}^{est}$ is an unbiased estimator for $N$ , the mean over these estimates ${\bar{N}}^{est}$ departs from $N$ . Therefore, to obtain accurate mean estimates that allow a fair comparison of bias between the different estimators, we set the parameters of each scenario such that the probability of failures is close to zero. This was tested by checking if there were estimates that are larger than 100 times their true $N$ . If this was the case, this estimate was defined as a failure and replaced by the EB estimate for the same replication, which is also relatively large but cannot be a failure. If this occurred this is indicated by a $†_{(i)}$ in the superscript where $i$ is the number of failures.

3.1. Simulation Study with Two Samples

This section presents the results of a simulation study that compares the LP, Bailey, EB, RL, and Chapman estimator in case of two samples. The simulation settings for $N$ , $p_{A}$ , and $p_{B}$ in each scenario are shown in Table 2.

Table 2.

Seven Scenarios for a Two Sample Simulation Study.

$S$	$N$	$p_{A}$	$p_{B}$
1	100	0.5	0.2
2	100	0.35	0.3
3	500	0.4	0.15
4	500	0.25	0.2
5	10,000	0.3	0.1
6	10,000	0.25	0.15
7	100	0.15	0.15

According to the expected bias expression in Equation (6), the bias in the Chapman-estimator should be small in scenario $1 - 6$ . To illustrate how the different estimators are affected when the expected bias of the Chapman-estimator is substantial, we added a $7^{th}$ scenario for which this should hold.

Table 3 shows the results for the mean estimates $\bar{N}$ and standard errors (SEs) of the different estimators. In this simple two sample setting the RL- and Chapman-estimator are equivalent and are therefore shown in one column. The *s in the $\bar{N}$ column of the Chapman- and RL-estimator indicate that for p-value $= 0.05$ , in five out of the six regular scenarios, the hypothesis $N = {\hat{N}}^{Chapman / RL}$ cannot be rejected. For p-value $= 0.01$ this holds for all six regular scenarios. The same does not hold for the other estimators, for which the mean over all replications, in most cases, significantly differs from $N$ for p-value $= 0.001$ , and for all cases for p-value $= 0.05$ . In almost all cases the bias and the SEs of the Chapman- and RL-estimator are smaller than the bias and SEs of the other estimators, which shows that with two samples the Chapman- and RL-estimator are superior to the other estimators. If expected bias in the Chapman-estimator becomes substantial, as in scenario 7, all estimators are considerably biased. Note here that the bias in the Chapman-estimator is 7.7, which is close to equal to the expected bias of 7.5 that results from Equation (6).

Table 3.

Simulation Study with $20, 000$ Replications for the Seven Scenarios in Table 2.

		Maximum likelihood		Bailey		Evans & Bonett		Chapman/Rivest & Lévesque
$S$	$\bar{n}$	$\bar{N}$	$SE$	$\bar{N}$	$SE$	$\bar{N}$	$SE$	$\bar{N}$	$SE$
1	60.0	105.3^*** $†_{(1)}$	27.8	96.1^***	20.8	105.2^***	25.8	100.1	21.9
2	54.5	106.0^***	28.7	98.0^***	22.2	105.3^***	26.3	100.4^*	23.0
3	244.9	508.3^***	70.2	493.6^***	65.6	507.4^***	68.9	499.2	66.5
4	200.1	512.4^***	85.7	495.4^***	78.9	509.3^***	83.3	499.4	79.7
5	3,699.2	10,018.0^***	460.9	9,987.9^***	457.9	10,013.1^***	459.9	9,996.9	458.4
6	3,624.9	10,016.7^***	411.3	9,993.9^*	409.3	10,012.5^***	410.7	9,999.6	409.6
7	27.8	142.7^*** $†_{(2065)}$	109.6	87.2^***	45.8	128.0^***	104.2	92.3^***	48.8

Note. $\bar{n}$ gives the mean number of observed units $n$ over all replications. The superscripts ^*, ^**, and ^*** indicate that we can reject ${\hat{N}}^{est} = N$ with a two-sided t-test with p-values = $0.05$ , and $0.001$ respectively. A † in the superscript indicates that failures, which are defined as estimates that are 100 times larger than N, were replaced with the EB estimate for that replication.

3.2. Simulation Studies with Three and Four Samples

This section studies which estimator should be preferred in case of three or four samples, for which the RL-estimator and Generalized Chapman-estimator are no longer equivalent as in the previous section. This section contains two simulation studies. The first simulation study uses the saturated log-linear model for estimation and the second simulation study uses the correct unsaturated log-linear model for estimation. Each simulation study considers the same fifteen scenarios, indicated by $S = 1, \dots 15$ , which are presented in Table 4. Scenarios differ with respect to the size of the population $N$ , the number of sources $k$ , and the log-linear model specifications (i.e., different values for $p_{A}$ , $p_{B}$ , $p_{C}$ , $p_{D}$ , $θ^{AB}$ , $θ^{AC}$ , $θ^{AD}$ , $θ^{BC}$ , and $θ^{CD}$ , see Hammond et al. 2024, for further details). ${LLM}^{S}$ denotes the log-linear model that is used to generate $n$ . The parameters are chosen such that ${LLM}^{S}$ = IND for $S = 1, 2, 3, 13, 14$ , ${LLM}^{S}$ = 1PD for $S = 4, 5, 6$ , ${LLM}^{S}$ = 2PD for $S = 7, 8, 9$ , ${LLM}^{S}$ = SAT for $S = 10, 11, 12$ and ${LLM}^{S}$ = 4PD for $S = 15$ , which is a log-linear model with four samples and four pairs of dependent sources. Finally, scenario $12$ is set such that $λ_{ab}^{AB} = λ_{ac}^{AC} = λ_{bc}^{BC}$ , which as discussed in Section 2.5, corresponds to the model $M_{th}$ that underlies the RL-estimator. For this estimator the counts $n_{1000}$ , $n_{0100}$ , $n_{0010}$ , $n_{0001}$ , $n_{1100}$ , $n_{1010}$ , $n_{1001}$ , $n_{0110}$ , $n_{0101}$ , and $n_{0011}$ are replaced by the modified $n_{1000} + 1 / 4$ , $n_{0100} + 1 / 4$ , $n_{0010} + 1 / 4$ , $n_{0001} + 1 / 4$ , $n_{1100} + 1 / 6$ , $n_{1010} + 1 / 6$ , $n_{1001} + 1 / 6$ , $n_{0110} + 1 / 6$ , $n_{0101} + 1 / 6$ , and $n_{0011} + 1 / 6$ .

Table 4.

Three and Four Sample Simulation Scenarios.

$S$	$N$	$k$	$p_{A}$	$p_{B}$	$p_{C}$	$p_{D}$	$θ_{AB}$	$θ_{AC}$	$θ_{BC}$	$θ_{AD}$	$θ_{BD}$	$θ_{CD}$	${LLM}^{S}$
1	100	3	0.5	0.4	0.3		1	1	1				IND
2	500	3	0.4	0.3	0.2		1	1	1				IND
3	10,000	3	0.35	0.3	0.25		1	1	1				IND
4	100	3	0.5	0.4	0.3		1.5	1	1				1PD
5	500	3	0.4	0.3	0.2		1.5	1	1				1PD
6	10,000	3	0.35	0.3	0.25		1.5	1	1				1PD
7	100	3	0.5	0.4	0.3		1.5	1	0.5				2PD
8	500	3	0.4	0.3	0.2		1.5	1	0.5				2PD
9	10,000	3	0.35	0.3	0.25		1.5	1	0.5				2PD
10	100	3	0.5	0.4	0.3		1.5	0.75	0.5				SAT
11	500	3	0.4	0.3	0.2		1.5	0.75	0.5				SAT
12	10,000	3	0.3	0.3	0.3		1.25	1.25	1.25				SAT
13	1,000	4	0.35	0.3	0.25	0.2	1	1	1	1	1	1	IND^a
14	20,000	4	0.25	0.2	0.15	0.1	1	1	1	1	1	1	IND^a
15	20,000	4	0.25	0.2	0.15	0.1	1.5	1	0.75	1.5	1	0.5	4PD^a

The three-way interaction parameters $θ_{ABC}$ , $θ_{ACD}$ , and $θ_{BCD}$ are set to $1$ .

3.2.1. Results for the Saturated Log-Linear Model

The results presented in Table 5 show that, for an assumed saturated model, the Generalized Chapman-estimator performs best of the tested estimators in each scenario. For $p = 0.01$ it gives mean values that cannot be rejected to be different from $N$ in fourteen out of fifteen scenarios. Also, in the scenarios $5$ , $7$ , $10$ , and $13$ where the Generalized Chapman-estimator shows some statistically significant bias for $p = 0.001$ , the bias is small in itself and much smaller than in the other estimators. For the IND and 1PD model with large $N$ , the bias in the ML and EB estimates is large. As expected, the RL-estimator performs clearly better than the EB-estimator, but nonetheless still shows some statistically significant bias for most scenarios as well, especially for scenario $13$ or when $N = 100$ or $500$ . Interestingly, also for scenario $12$ that matches the $M_{th}$ model, the Generalized Chapman-estimator slightly outperforms the RL-estimator, both in mean bias and SE.

Table 5.

Simulation Study with Assumed Saturated Log-Linear Models, with $100, 000$ Replications for Scenarios $1 - 15$ in Table 4.

		Maximum likelihood		Evans & Bonett		Rivest & Lévesque		Generalized Chapman
$S$	$\bar{n}$	$\bar{N}$	SE	$\bar{N}$	SE	$\bar{N}$	SE	$\bar{N}$	SE
1	79.0	113.2^***	61.7	112.3^***	58.8	103.3^***	38.2	100.0	24.1
2	332.0	521.8^***	104.4	522.1^***	102.9	507.0^***	94.4	500.2	89.9
3	6,587.4	10,016.0^***	363.5	10,016.8^***	363.4	10,004.6^***	362.2	9,999.0	361.5
4	77.3	116.9^***	80.3	115.2^***	77.0	104.0^***	46.8	99.9	26.4
5	323.8	525.3^***	111.7	524.5^***	109.5	508.4^***	99.7	500.8^***	94.5
6	6,439.5	10,017.8^***	373.7	10,018.0^***	373.6	10,005.6^***	372.2	9,999.4	371.5
7	79.1	120.6^***	90.6	118.7^***	88.1	104.0^***	44.2	99.6^***	24.9
8	330.5	532.2^***	156.5	530.8^***	150.3	510.0^***	127.9	500.3	106.2
9	6,608.9	10,020.9^***	391.4	10,021.6^***	391.2	10,007.3^***	389.6	10,000.6	388.6
10	80.0	118.2^***	83.6	116.6^***	80.8	103.8^***	43.9	99.7^***	24.1
11	334.1	529.5^***	138.3	529.4^***	133.4	508.9^***	112.4	499.9	104.0
12	6,355.3	10,016.7^***	365.5	10,016.7^***	365.3	10,005.2^***	364.2	9,999.4	363.5
13	727.0	1,205.5^*** $†_{(498)}$	1,281.9	1,196.4^***	1,270.4	1,183.8^*** $†_{(498)}$	1,272.0	996.8^***	286.9
14	10,819.7	20,904.5^***	4,328.8	20,859.1^***	4,272.2	20,846.6^***	4,303.4	20,005.8	3,730.5
15	10,677.5	21,333.1^***	7,684.4	21,281.8^***	7,578.6	21,255.0^***	7,650.6	20,033.2^**	4,724.0

Note. $\bar{n}$ gives the mean number of observed units $n$ over all replications. The superscripts ^**, and ^*** indicate that we can reject ${\hat{N}}^{est} = N$ with a two-sided t-test with p-values = 0.01 respectively. A $†$ in the superscript indicates that failures, which are defined as estimates that are 100 times larger than $N$ , were replaced with the EB estimate for that replication.

The Generalized Chapman-estimator not only outperforms the other estimators in terms of smaller mean bias, also the SEs are substantially smaller. This holds especially for scenarios with smaller $N$ , but also for scenarios with four samples and large $N$ , irrespective of the model specification that was used to generate the contingency tables.

3.2.2. Results for Unsaturated Log-Linear Models

Table 6 shows the result of a simulation study that uses the same scenarios as in Table 4, but now the estimates are based on the correct unsaturated log-linear models. The results of the scenarios $10 - 12$ are not shown in Table 6 because they are already provided in Table 5, as for these scenarios the saturated log-linear model is also the true model.

Table 6.

Simulation Study with Assumed Correctly Specified Log-Linear Models, with $100, 000$ Replications, for Scenarios $1 - 9$ , $13 - 15$ in Table 4.

		Maximum likelihood		Evans & Bonett		Rivest & Lévesque		Generalized Chapman
$S$	$\bar{n}$	$\bar{N}$	SE	$\bar{N}$	SE	$\bar{N}$	SE	$\bar{N}$	SE
1	79.0	100.5^***	8.1	100.7^***	8.0	100.7^***	8.0	99.9^***	7.9
2	332.0	501.5^***	28.6	501.1^***	28.4	501.3^***	28.4	499.9	28.3
3	6,587.4	10,001.3^***	126.1	10,000.8^**	126.0	10,001.0^***	126.1	9,999.7	126.0
4	77.3	101.2^***	11.6	101.2^***	11.3	99.9^***	10.8	100.0	10.9
5	323.8	503.6^***	41.4	502.7^***	40.9	499.8^*	40.2	500.3^**	40.4
6	6,439.5	10,003.1^***	164.4	10,002.5^***	164.3	10,000.1	164.2	10,000.4	164.2
7	79.1	102.8^***	15.8	102.8^***	15.1	100.8^***	13.1	100.0	12.3
8	330.5	506.4^***	48.6	506.2^***	48.2	502.3^***	46.8	500.3^**	46.0
9	6,608.9	10,005.1^***	192.7	10,005.0^***	192.6	10,001.6^***	192.4	9,999.8	192.2
13	727.0	1,000.7^***	30.1	1,000.0	29.9	1,001.6^***	30.1	999.5^***	29.9
14	10,819.7	20,001.7^***	255.6	19,997.3^***	255.4	20,002.0^**	255.5	19,996.9^***	255.4
15	10,677.5	20,011.6^***	414.4	20,010.2^***	414.3	20,005.2^***	413.9	20,000.1	413.6

Note. $\bar{n}$ gives the mean number of observed units $n$ over all replications. The superscripts ^*, ^**, and ^*** indicate that we can reject ${\hat{N}}^{est} = N$ with a two-sided t-test with p-values = $0.05, 0.01$ , and $0.001$ respectively.

Modification schemes for unsaturated models with four samples, as in the last three scenarios, were not presented in Table 1 or Subsection 2.5, so for the IND and 4PD model they are given here. For the Generalized Chapman estimator and the IND model, $n_{1111}$ , $n_{1110}$ , $n_{1101}$ , $n_{1011}$ , and $n_{0111}$ are replaced by $n_{1111} + 3 / 11$ , $n_{1110} + 1 / 11$ , $n_{1101} + 1 / 11$ , $n_{1011} + 1 / 11$ , and $n_{0111} + 1 / 11$ respectively, and for the 4PD model, $n_{1110}$ , $n_{1101}$ , $n_{1011}$ , $n_{0111}$ , $n_{1010}$ , and $n_{0101}$ are replaced by $n_{1110} + 1 / 7$ , $n_{1101} + 1 / 7$ , $n_{1011} + 1 / 7$ , $n_{0111} + 1 / 7$ , $n_{1010} + 3 / 7$ , and $n_{0101} + 3 / 7$ respectively. For the RL-estimator and the IND model, $n_{1000}$ , $n_{0100}$ , $n_{0010}$ , $n_{0001}$ , $n_{1100}$ , $n_{1010}$ , $n_{1001}$ , $n_{0110}$ , $n_{0101}$ , and $n_{0011}$ are replaced by $n_{1000} + 1 / 8$ , $n_{0100} + 1 / 8$ , $n_{0010} + 1 / 8$ , $n_{0001} + 1 / 8$ , $n_{1100} + 1 / 3$ , $n_{1010} + 1 / 3$ , $n_{1001} + 1 / 3$ , $n_{0110} + 1 / 3$ , $n_{0101} + 1 / 3$ , and $n_{0011} + 1 / 3$ , and for the 4PD model the modification scheme is the same as for the $M_{th}$ model, as given in Subsection 3.2.

The results in Table 6 are, for all estimators, clearly better than those in Table 5. The use of the correct unsaturated model for each scenario has substantially reduced the bias and SEs of all estimators. However, these improvements did not affect the (in)significance of the bias. This is shown by the many *s that are still present in all columns except for the column of the Generalized Chapman-estimator, which contains relatively few. Also, although the SEs of each estimator became much smaller and more alike, the SEs of the Generalized Chapman-estimator are still the smallest. Furthermore, in scenarios where the Generalized Chapman-estimator shows some statistically significant bias, it is in most cases smaller than the bias of other estimators, which to some extend also holds for the RL-estimator. This shows that, also for unsaturated models, the Generalized Chapman-estimator performs better than the other estimators.

Particularly interesting is the performance and comparison of the Generalized Chapman-estimator with the RL-estimator in the scenarios $1 - 3$ and $13 - 14$ , because for these scenarios the Generalized Chapman-estimator is not supported by a closed form expression derivation and it can be directly compared with the RL-estimator, as discussed in Section 2.5. For scenarios $3$ and $13 - 14$ , with relatively large $N$ , there is not much difference, as both estimators show small but statistically significant bias. However, for scenarios $1$ and $2$ , which concern relatively small $N$ , the Generalized Chapman-estimator shows very little and statistically clearly less bias than the RL-estimator. Finally, for these scenarios the SEs of the Generalized Chapman-estimator are slightly but consistently smaller than those of the RL-estimator. Finally, in the 4DP model scenario $15$ , for which the Generalized Chapman-estimator is also not supported by a closed form expression derivation, the Generalized Chapman-estimator outperforms the other estimators, mainly in terms of bias. Together these results are further support for the modification scheme in Equation (18).

A comparison of the SEs in Tables 5 and 6 shows that the Generalized Chapman-estimator suffers less from overspecification, as is the case with the saturated model for most scenarios. For example, scenario $7$ in Tables 5 and 6 shows that, when instead of the correctly specified 2PD model, the saturated model is assumed, the SE of the Generalized Chapman-estimator increases from $12.3$ to $24.9$ . The other bias-reduced estimators show a much larger increase in SE, for example the SE of the RL-and EB-estimator increase from $13.1$ to $44.2$ and from $15.1$ to $88.1$ respectively. Similar increases can be observed in other scenarios.

4. Example: Number of Homeless People in the Netherlands

A population size estimate of the homeless people in the Netherlands is published annually by Statistics Netherlands. This estimate is an ML estimate that is based on a MSE model that is discussed in detail in Coumans et al. (2017). The estimate is based on a log-linear model that contains three sources and several (categorical) covariates, such as gender ( $g$ , $2$ categories), age ( $a$ , $3$ categories), place of living, in- or outside one of the big four Dutch cities ( $p$ , $2$ categories), and region of origin ( $r$ , $3$ categories). Together there are $36$ subgroups that have observed frequencies denoted as $n_{gapr}$ and an observed frequency with a specific inclusion pattern denoted as $n_{abc, gapr}$ . Which sources, covariates, and interactions between them are included in the log-linear model, is the result of an Akaike information criterion (AIC) based model selection procedure that is explained in Coumans et al. (2017). Recent work by Silverman (2020) suggests that other model selection approaches based on Bayesian approaches could lead to more robust and stable results, but this is beyond the scope of this paper.

In this practical example, for the years 2009 to 2018, 2020, and 2021, we replicate the model selection and estimation procedure as explained in Coumans et al. (2017). Data for 2019 are unavailable. The model selection mechanism is performed independently for each year, and so it may select a different log-linear model for each year, with different sample and covariate dependencies. However, this model selection mechanism is based on the unmodified $n$ , and so within each year the same log-linear model is selected and used for each of the different estimates. This allows us to calculate the difference between the ML and Generalized Chapman estimates, all other factors held constant, in a practical example. We also tested the impact of modifying $n$ first and next apply the model selection procedure, but in this case this had little impact. The estimation procedure is repeated for each year, which gives an annual time series of Generalized Chapman, ML, EB, and RL estimates for the population size of homeless people in The Netherlands. Note that for the RL-estimator, the modification scheme for the $M_{th}$ model, as given in Subsection 2.5, is used. Figure 1a to c shows these estimates for the total number of homeless people, the total number of homeless men, and the total number of homeless women, including their two-sided 95% confidence intervals. These confidence intervals are shown by their upper bound (UB) and lower bound (LB), which are based on the parametric bootstrap (compare van der Heijden et al. 2012). Note that each figure has its own scale on the y-axis.

Figure 1.

(a) Total number of homeless people, (b) homeless men, and (c) homeless women in the Netherlands over the period 2009 to 2018 and 2020 to 2021.

In each figure the EB and RL estimates (grey dotted) are always larger than the Generalized Chapman estimates (black solid) and always smaller than the ML estimates (grey solid). The ML estimates are between a minimum of 9.5% (2021) and a maximum of 25.5% (2018) larger than the Generalized Chapman-estimator. The confidence interval of the Generalized Chapman-estimators is clearly smaller. Figure 1b and c show that the total annual difference between both estimators, as was observed in Figure 1a, is not proportionally divided over men and women. In fact, the Generalized Chapman estimates have, relatively, a much larger impact on the estimate of the number of homeless women, which is the smaller group. For women the difference between the estimates is between a minimum of 19.5% in 2017 and a maximum of 51.2% in 2018.

In this practical application the impact of using the Generalized Chapman-estimator instead of the ML-estimator is larger than the impact we have seen in the simulation studies. The reason for this difference is twofold. First, the scenarios in the simulation studies were set such that the probability of estimation failures was very small, which led to a mean coverage (i.e., $\bar{n} / N$ ) that was large compared to the coverage in our example of homeless people. Second, the MSE model to estimate the number of homeless people involves the use of (categorical) covariates to control for heterogeneity in inclusion probabilities. Because for some homeless people their background characteristics are missing, the estimation procedure uses an expectation–maximization (EM) algorithm to impute missing data (see Coumans et al. 2017, for further details), which for some inclusion patterns may lead to observed frequencies between zero and one. To see why this is important we zoom in on the underlying subgroup estimates for men and women in the year 2021 presented in Table 7 below.

Table 7.

Estimated Number of Homeless People in The Netherlands in 2021, Separated by Men and Women and $18$ Subgroups Based on Age, Living In- or Outside One of the Four Big Dutch Cities and Country of Origin.

$G_{apr}$	Men					Women
	$n_{Mapr}$	$n_{101, Mapr}$	${\hat{N}}_{Mapr}^{2 PD, ML}$	${\hat{N}}_{Mapr}^{2 PD, Chap - 3}$	$Δ_{Mapr}$	$n_{Wapr}$	$n_{101, Wapr}$	${\hat{N}}_{Wapr}^{2 PD, ML}$	${\hat{N}}_{Wapr}^{2 PD, Chap - 3}$	$Δ_{Wapr}$
1	1,956	134.07	4,279	4,263	−16	388	8.10	787	678	−109
2	1,283	45.78	4,687	4,464	−223	211	4.03	993	750	−243
3	1,130	37.41	4,458	4,304	−154	164	2.56	760	582	−178
4	516	17.62	1,006	978	−28	97	1.52	170	147	−23
5	496	9.56	2,241	2,065	−176	76	0.90	333	245	−88
6	491	41.02	1,316	1,278	−38	123	3.65	325	264	−61
7	436	36.36	1,072	1,055	−17	102	2.82	243	202	−41
8	350	12.83	1,388	1,302	−86	52	1.11	279	204	−75
9	319	11.04	1,224	1,133	−91	57	1.24	314	226	−88
10	241	7.72	555	533	−22	45	0.66	92	77	−15
11	237	6.07	1,222	989	−233	47	0.63	311	198	−113
12	224	4.84	952	890	−62	35	0.46	142	107	−35
13	201	11.23	685	586	−99	55	1.02	181	130	−51
14	106	2.71	329	274	−55	25	0.29	66	48	−18
15	95	7.82	287	275	−12	28	0.90	89	70	−19
16	91	1.44	561	435	−126	17	0.17	104	65	−39
17	46	1.15	252	194	−58	9	0.14	72	45	−27
18	35	1.90	150	120	−30	11	0.24	50	34	−16
Total	8,253	390.57	26,664	25,138	−1,526	1,542	30.44	5,311	4,072	−1,239

Table 7 presents $18$ subgroups indicated by $G_{apr}$ for both men and women. For each subgroup we show both the total observed count $n_{gapr}$ and the observed count $n_{101, gapr}$ for inclusion pattern $101$ . This specific inclusion pattern is shown because the selected log-linear model is a 2-pair dependence model, for which Table 1 tells us that $n_{101}^{Chap} = n_{101} + 1$ is the only modified observed frequency, while the other elements in $n_{abc}^{Chap}$ are equal to $n_{abc}$ . The difference between $n_{101, gapr}$ and $n_{101, gapr}^{Chap}$ should therefore explain the difference between $N_{gapr}^{ML}$ and $N_{gapr}^{Chap}$ . This difference is shown in the columns $Δ_{Mapr} = {\hat{N}}_{Mapr}^{2 PD, Chap - 3} - {\hat{N}}_{Mapr}^{2 PD, ML}$ and $Δ_{Wapr} = {\hat{N}}_{Wapr}^{2 PD, Chap - 3} - {\hat{N}}_{Wapr}^{2 PD, ML}$ .

When we compare the columns $Δ_{Mapr}^{Chap - ML}$ and $Δ_{Wapr}^{Chap - ML}$ in Table 7, we see that despite the fact that observed counts of men are larger than those of women, differences in counts of subgroups of men and women are very similar. This can be explained by the smaller observed frequencies for women with inclusion pattern $101$ , that are sometimes even between zero and one, as can be seen in the columns of $n_{101, Mapr}$ and $n_{101, Wapr}$ . Adding $1$ to such a small number has a relatively large impact on the population size estimate.

Finally, we note that the generalized Chapman estimates follow a similar trend to the ML-estimates, which is relevant from a policy perspective. From this perspective a relevant difference between the ML and generalized Chapman estimates can be observed in the years 2018 and 2020, where the ML estimate in 2020 is lower compared to 2018 while the reverse is true for the 2020 generalized Chapman estimate. This might be due to the large ML estimate in $2018$ , which may have been an overestimation due to small sample bias. However, the estimates and conclusions presented in this section should be treated with care, because it is not clear whether in this study the size of the population and the size of the samples is sufficient to prevent substantial expected bias as in the bias expression of the Chapman-estimator in Equation (6). This is not clear because an expression such as in Equation (6), which would be valuable, does not exist and is beyond the scope of this paper. Finally, the data on homeless people in The Netherlands that were used for this section is not publicly available due to legal restrictions.

5. Discussion

This paper presents a generalization of the Chapman-estimator and compares it to other MSE-estimators known in literature. The Generalized Chapman-estimator is first derived for saturated models and then extended such that it can deal with unsaturated models. For saturated models and a small set of unsaturated models the Generalized Chapman-estimator was derived mathematically with the help of a closed form expression of the ML-estimator. Further generalization to other unsaturated models was achieved by using the least squares estimator. The Generalized Chapman-estimator is tested in different simulation studies, which show that the Chapman-estimator outperforms other bias-reduced estimators both in terms of bias and standard error.

The derivations and simulation studies in this paper show that for any unsaturated model with three samples and some unsaturated models with four samples, or a saturated model with any number of samples, the Generalized Chapman-estimator is essentially unbiased or shows hardly any bias. We suspect that this result can be generalized toward any unsaturated model with any number of samples, although this paper does not provide a mathematical proof. We think that further research that proves, or disproves, our suspicion would be valuable.

The simulation studies also show that the bias and SE of the Generalized Chapman-estimator are less affected by overdispersion than other MSE-estimators. This advantage is important because in practice a model is usually the result of some model selection procedure, which does not guarantee the selection of the correct model and may therefore contain redundant variables.

In Section 4 the Generalized Chapman-estimator is used to estimate the number of homeless people in The Netherlands for a series of years and compares these estimates with the ML estimates. For each year both estimates are based on the same log-linear model as discussed in Coumans et al. (2017). This comparison showed that the impact of bias-correction can be substantial, for example, in our example the use of the Generalized Chapman-estimator led to an estimate that was between 9.3% and 25.4% lower for the total number of homeless people in The Netherlands, as compared to the corresponding ML-estimator. This relative difference became even larger, going up to 51%, when we zoomed in on the subgroup of women.

The simulation studies and the example in Section 4 show that the difference between the generalized Chapman- and the standard ML-estimator can be substantial. This raises the question whether finite-sample bias correction should not have a more prominent role in the discussion on the robustness of MSE methodology and the accuracy of MSE estimates, which continues till today (see e.g., Binette and Steorts 2022; Silverman 2020). Finally, in this context it would also be valuable to have expected bias expressions for MSE estimators for more than two samples, such as the bias expression by Rivest et al. (1995) in Equation (6) for the Chapman-estimator. This would give MSE practitioners more insight in the potential accuracy of their MSE estimates.

Software

All simulation studies in this paper are performed in the statistical software program R (R Core Team 2022). All estimates are obtained with the glm() function, with family = poisson(link = “log”). Differences between the LP, ML, Chapman, Bailey, EB, and RL are the sole result of different input vectors $n^{est}$ . For the IND model the estimation results for the RL-estimator were verified with the function closedp.bc() with m = “Mt” and m = “Mth” from the R-package Rcapture (Rivest 2022). Code for the simulation studies presented in this paper is available at https://github.com/DaanZult/ChapmanMSE/.

Supplemental Material

sj-pdf-1-jof-10.1177_0282423X251314294 – Supplemental material for Bias Correction in Multiple Systems Estimation

Supplemental material, sj-pdf-1-jof-10.1177_0282423X251314294 for Bias Correction in Multiple Systems Estimation by Daan B. Zult, Peter G. M. van der Heijden and Bart F. M. Bakker in Journal of Official Statistics

Supplemental Material

sj-pdf-2-jof-10.1177_0282423X251314294 – Supplemental material for Bias Correction in Multiple Systems Estimation

Supplemental material, sj-pdf-2-jof-10.1177_0282423X251314294 for Bias Correction in Multiple Systems Estimation by Daan B. Zult, Peter G. M. van der Heijden and Bart F. M. Bakker in Journal of Official Statistics

Footnotes

Acknowledgements

The authors thank Jeroen Pannekoek, Peter-Paul de Wolf, Sander Scholtus, and Moniek Coumans from Statistics Netherlands and (anonymous) reviewers for their detailed comments and suggestions on this paper.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Daan B. Zult

Supplemental Material

Supplemental material for this article is available online.

Received: January 2024

Accepted: December 2024

References

Bailey

N. T. J.

1951. “On Estimating the Size of Mobile Populations from Recapture Data.”Biometrika 38 (3/4): 293–306. DOI: https://doi.org/10.2307/2332575.

Binette

Steorts

R. C.

2022. “On the Reliability of Multiple Systems Estimation for the Quantification of Modern Slavery.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 185 (2): 640–76. DOI: https://doi.org/10.1111/rssa.12803.

Bird

S. M.

King

2018. “Multiple Systems Estimation (or Capture-Recapture Estimation) to Inform Public Policy.”Annual Review of Statistics and Its Application 5: 95–118. DOI: https://doi.org/10.1146/annurev-statistics-031017-100641.

Bishop

Y. M. M.

Fienberg

S. E.

Holland

P. W.

1975. Discrete Multivariate Analysis. New York: Springer. DOI: https://doi.org/10.1007/978-0-387-72806-3.

Chao

2001. “An Overview of Closed Capture–Recapture Models.”Journal of Agricultural, Biological, and Environmental Statistics 6: 158–75. DOI: https://doi.org/10.1198/108571101750524670.

Chapman

D. G.

1951. Some Properties of the Hypergeometric Distribution with Applications to Zoological Sample Censuses. Berkeley: University of California Press. https://babel.hathitrust.org/cgi/pt?id=wu.89045844248&view=1up&seq=3.

Chapman

D. G.

1952. “Inverse, Multiple and Sequential Sample Censuses.”Biometrics 8 (4): 286–306. DOI: https://doi.org/10.2307/3001864.

Cormack

R. M.

1989. “Log-Linear Models for Capture-Recapture.”Biometrics 45 (2): 395–413. DOI: https://doi.org/10.2307/2531485.

Coumans

M. A.

Cruyff

van der Heijden

P. G. M.

Wolf

Schmeets

2017. “Estimating Homelessness in The Netherlands Using a Capture-Recapture Approach.”Social Indicators Research 130 (1): 89–212. DOI: https://doi.org/10.1007/s11205-015-1171-7.

10.

Cramer

1922. Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press. https://archive.org/details/in.ernet.dli.2015.149716/page/n515/mode/2up.

11.

Darroch

J. N.

1958. “The Multiple-Recapture Census: I. Estimation of a Closed Population.”Biometrika 45 (3/4): 343–59. DOI: https://doi.org/10.2307/2333183.

12.

Evans

M. A.

Bonett

D. G.

1994. “Bias Reduction for Multiple-Recapture Estimators of Closed Population Size.”Biometrics 50 (2): 388–95. DOI: https://doi.org/10.2307/2533382.

13.

Fienberg

S. E.

1972. “The Multiple Recapture Census for Closed Populations and Incomplete

2^{k}

Contingency Tables.”Biometrika 59 (3): 591–603. DOI: https://doi.org/10.2307/2334810.

14.

Firth

1993. “Bias Reduction of Maximum Likelihood Estimates.”Biometrika 80 (1): 27–38. DOI: https://doi.org/10.2307/2336755.

15.

Frome

E. L.

Kutner

M. H.

Beauchamp

J. J.

1973. “Regression Analysis of Poisson-Distributed Data.”Journal of the American Statistical Association 68 (344): 935–40. DOI: https://doi.org/10.1080/01621459.1973.10481449.

16.

Hald

A. H.

1952. Statistical Theory with Engineering Applications. New York: John Wiley & Sons, Inc. https://archive.org/details/statisticaltheor0000ahal/mode/2up?view=theater.

17.

Hammond

van der Heijden

P. G. M.

Smith

P. A.

2024. “Generating Contingency Tables with Fixed Marginal Probabilities and Dependence Structures Described by Loglinear Models.”Journal of Statistical Computation and Simulation 94 (12): 2797–812. DOI: https://doi.org/10.1080/00949655.2024.2353760.

18.

International Working Group for Disease Monitoring and Forecasting. 1995. “Capture-Recapture and Multiple-Record Systems Estimation I: History and Theoretical Development.”American Journal of Epidemiology 142 (10): 1047–58. DOI: https://doi.org/10.1093/oxfordjournals.aje.a117558.

19.

Jewell

N. P.

1986. “On the Bias of Commonly Used Measures of Association for 2 x 2 Tables.”Biometrics 42 (2): 351–8. http://www.jstor.org/stable/2531055 (accessed June 17, 2024).

20.

Kosmidis

Firth

2011. “Multinomial Logit Bias Reduction via the Poisson Log-Linear Model.”Biometrika 98 (3): 755–9. https://www.jstor.org/stable/23076146.

21.

Kosmidis

Firth

2021. “Jeffreys-Prior Penalty, Finiteness and Shrinkage in Binomial-Response Generalized Linear Models.”Biometrika 108: 71–82. DOI: https://doi.org/10.1093/biomet/asaa052.

22.

Kosmidis

Kenne Pagui

E. C.

Sartori

2020. “Mean and Median Bias Reduction in Generalized Linear Models.”Statistics and Computing 30: 43–59. DOI: https://doi.org/10.1007/s11222-019-09860-6.

23.

Lincoln

F. C.

1930. Calculating Waterfowl Abundance on the Basis of Banding Returns. Vol. 118. United States Department of Agriculture. DOI: https://doi.org/10.5962/bhl.title.64010.

24.

Long

J. S.

1997. Regression Models for Categorical and Limited Dependent Variables. Vol. 7. Thousand Oaks, CA: Sage Publications, Inc. https://us.sagepub.com/en-us/nam/regression-models-for-categorical-and-limited-dependent-variables/book6071.

25.

Miller

D. M.

1984. “Reducing Transformation Bias in Curve Fitting.”The American Statistician 38 (2): 124–6. DOI: https://doi.org/10.2307/2683247.

26.

Moore

E. H.

1920. “On the Reciprocal of the General Algebraic Matrix.”Bulletin of the American Mathematical Society 26 (9): 394–5. DOI: https://doi.org/10.1090/S0002-9904-1920-03322-7.

27.

Otis

D. L.

Burnham

K. P.

White

G. C.

Anderson

D. R.

1978. “Statistical Inference from Capture Data on Closed Animal Populations.”Wildlife Monographs 62: 3–135. https://www.jstor.org/stable/3830650.

28.

Penrose

1955. “A Generalized Inverse for Matrices.”Mathematical Proceedings of the Cambridge Philosophical Society 51 (3): 406–413. DOI: https://doi.org/10.1017/S0305004100030401.

29.

Petersen

C. G. J.

1896. “The Yearly Immigration of Young Plaice into the Limfjord from the German Sea.”Report of the Danish Biological Station 6: 5–84. https://archive.org/details/reportofdanishbi06dans/page/n1/mode/2up.

30.

R Core Team. 2022. “R: A Language and Environment for Statistical Computing.”Computer Software Manual. https://www.R-project.org/.

31.

Rainey

McCaskey

2021. “Estimating Logit Models with Small Samples.”Political Science Research and Methods 9 (3): 549–64. DOI: https://doi.org/10.1017/psrm.2021.9.

32.

Rivest

2022. “Rcapture: Loglinear Models for Capture-Recapture Experiments.”Computer Software Manual. https://cran.r-project.org/web/packages/Rcapture/Rcapture.pdf.

33.

Rivest

Lévesque

2001. “Improved Log-Linear Model Estimators of Abundance in Capture-Recapture Experiments.”The Canadian Journal of Statistics 29 (4): 555–72. DOI: https://doi.org/10.2307/3316007.

34.

Rivest

Potvin

Crepeau

Daigle

1995. “Statistical Methods for Aerial Surveys Using the Double-Count Technique to Correct Visibility Bias.”Biometrics 51 (2): 461–70. http://www.jstor.org/stable/2532934 (accessed July 1, 2024).

35.

Sanathanan

1972. “Estimating the Size of a Multinomial Population.”The Annals of Mathematical Statistics 130 (1): 142–52. https://www.jstor.org/stable/2239906.

36.

Silverman

B. W.

2020. “Multiple-Systems Analysis for the Quantification of Modern Slavery: Classical and Bayesian Approaches.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 183: 691–736. DOI: https://doi.org/10.1111/rssa.12505.

37.

Stephan

F. F.

1945. “The Expected Value and Variance of the Reciprocal and Other Negative Powers of a Positive Bernoullian Variate.”The Annals of Mathematical Statistics 16: 50–61. DOI: https://doi.org/10.1214/aoms/1177731170.

38.

Tilling

2001. “Capture-Recapture Methods—Useful or Misleading?”International Journal of Epidemiology 30(1): 12–14. DOI: https://doi.org/10.1093/ije/30.1.12.

39.

van der Heijden

P. G. M.

Whittaker

Cruyff

Bakker

B. F. M.

van der Vliet

2012. “People Born in the Middle East but Residing in The Netherlands: Invariant Population Size Estimates and the Role of Active and Passive Covariates.”The Annals of Applied Statistics 6 (3): 831–52. DOI: https://doi.org/10.1214/12-AOAS536.

40.

Wolter

K. M.

1986. “Some Coverage Error Models for Census Data.”Journal of the American Statistical Association 81: 338–46. DOI: https://doi.org/10.2307/2289222.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.12 MB

0.13 MB