Sage Journals: Discover world-class research

Abstract

Probability surveys are experiencing important drawbacks nowadays: costs are relatively high and participation rates are decreasing, which could yield less accurate estimates. Alternatively, nonprobability samples like administrative records are having a rise in popularity due to their convenience and low costs. Unfortunately, nonprobability samples are often selective and, as the underlying sampling design is unknown, estimators based on such samples are generally biased. Research is ongoing on how to deal with this selection bias. In this paper, a method is proposed that combines estimators from a probability and nonprobability sample on an aggregated level. Our estimator is constructed as a weighted mean of both estimators. The weight is chosen to minimize the expected value of the mean squared error (MSE) of the combined estimator under an assumed model for the bias in the estimator based on the nonprobability sample. Our method does not require any data on the level of the individual units in the samples. We performed simulation studies where two different methods of modeling the bias in the nonprobability sample were tested. We also applied one of these methods to a real dataset from Statistics Netherlands and showed that the MSE was indeed reduced in a real application.

Keywords

administrative data data integration probability samples nonprobability samples

1. Introduction

The sample survey, and the probability sampling framework with it, emerged as a field in the first half of the twentieth century intending to satisfy the need of keeping a general record of the nationwide population. This need has evolved into more complex ones, increasingly broken down to the point of requiring more and more precise statistics for different subpopulations which we can also call domains. The needs can be reflected by policymakers and by the importance of resource allocation, social program implementation, and environmental planning and by the private market, especially from small businesses that rely on such estimations (Pfeffermann 2002; Rao 2003).

Implementing a probability survey is demanding, but it carries several benefits, the most important being that it allows one to make unbiased inferences about a target population (Lohr 2019). Nevertheless, probability sampling has encountered many drawbacks in the last few years. Due to its high cost, increasing non-response, and debate about the coverage error that this implies, the validity of probability sampling has started to be questioned (Brick 2014).

Alongside that, in the latest years, there has been an increasing interest in the statistical use of data that can be obtained from different data sources far from the probability sampling framework: the so-called nonprobability samples. National Statistical Institutes (NSIs) are particularly interested in nonprobability samples since they are constantly motivated to improve the quality of estimators and reduce data collection costs (Van den Brakel 2019). Examples of nonprobability data at NSIs such as Statistics Netherlands (CBS) are many administrative datasets, such as administrative data from the Tax Office, from energy providers, and from education registers.

A nonprobability sample can be counterproductive for various reasons: there is no clear sampling framework to work with and such data can easily lead to a large bias in estimators because of selectivity. Selectivity is a major issue to deal with when using nonprobability samples. Selectivity refers to the situation in which the part of the target population included in the sample is substantially different from the part of the population that is not included in the sample, making it difficult to make inferences about the target population. Nonprobability samples are likely to be selective regarding the population and the selection probability of units usually remains unknown (M. R. Elliott and Valliant 2017). Nonetheless this type of sample can be considered inexpensive in comparison with a probability one, and methods that try to account for selectivity have been developed. Moreover, the use of nonprobability samples could eventually help reduce the variance of estimators for parameters of interest. In the case of CBS, efforts have been made to go from obtaining estimates from surveys to obtaining them from administrative records, which could be a valuable source of information to overcome the limitations of survey data (Linder et al. 2014).

The increased interest in nonprobability samples can be observed in a large amount of articles on the use of nonprobability samples for estimation purposes in recent years. For example, Baker et al. (2013), M. R. Elliott and Valliant (2017), Cornesse et al. (2020), Valliant (2020), and Rao (2021) provide overviews of methods for dealing with selective nonprobability samples. These papers show that results are not very promising for approaches that rely on nonprobability samples only.

In recent literature various approaches to integrate both probability and nonprobability samples have been proposed. Popular approaches are sample matching (see, e.g., Baker et al. 2013; Rivers 2007; Rivers and Bailey 2009), calibration approaches (see, e.g., Lee 2006; Lee and Valliant 2009), Bayesian approaches (see, e.g., Sakshaug et al. 2019; Wiśniowski et al. 2020), pseudo-weight approaches (see, e.g., M. R. Elliott 2009; M. R. Elliott and Valliant 2017; A.-C. Liu et al. 2023; Valliant and Dever 2011) and doubly robust approaches (see, e.g., Chen et al. 2020; Z. Liu et al. 2023; Yang et al. 2020). Other relevant articles on the use of nonprobability samples for estimation purposes are Meng (2018), who focused on assessing selection error and selection bias in estimators based on nonprobability samples, and Kim and Tam (2021), who examined the situation where a target variable is observed in a probability sample and, with measurement error, also in a nonprobability sample.

In general, the combination of samples has shown more encouraging results than using a nonprobability sample only and that is why the method we investigate in this article combines data from a large nonprobability sample and a small probability sample. Our method can be seen as an extension of an approach proposed by M. N. Elliott and Haviland (2007). Whereas they largely assumed that the bias in the estimator based on the nonprobability sample is known, we relax this assumption by using ideas from Pannekoek and De Waal (1998) to set up a relatively simple model for the bias of this estimator. Based on this model we construct a combined estimator that is a weighted mean of the estimator for the probability sample and the estimator for the nonprobability sample. We restrict attention to categorical data and focus on estimating proportions.

In contrast to most papers on combining probability samples with nonprobability samples, for our proposed method it is not necessary to link the two samples at the level of individual units. It is not even necessary to have any data at the level of individual units: only aggregated data are required. So, our proposed method can be applied in more cases than methods that require data at unit level. Obviously, when data are available at unit level, methods designed for that situation are very likely to give better results than our proposed method, simply because they utilize more information. In the worst case where none of the methods designed for the situation where unit level data for both samples are available gives acceptable results, one can always aggregate those data and apply our proposed method.

The rest of the article is organized as follows. In Section 2, we specify the proposed methodology. In Section 3, we describe the simulation conditions for the assessment of the method and assess its performance. In Section 4 we apply our method to real data from CBS. In Section 5 we summarize our main conclusions and propose suggestions for future research.

2. Methodology

We suppose there is a target population of $N$ units ( $i = 1, \dots, N$ ) with a categorical target variable $y$ with $C \geq 2$ categories—that is, $y_{i} \in {1, \dots, C}$ for each unit $i$ —and that observations on $y$ are available from a probability and a nonprobability sample, regardless of whether the same unit is present in one or both samples. Furthermore, we suppose units in the target population are divided into $K$ domains, where domain $k$ contains $N_{k}$ units. We aim to estimate a proportion $Z_{kc}$ for each category $c$ per domain $k$ in the target population as illustrated in Table 1, where each row represents a domain $k$ ( $k = 1, \dots, K$ ), each column represents a category $c$ ( $c = 1, \dots, C$ ) of $y$ , and each cell $(Z_{kc})$ represents the proportion of units within domain $k$ that belongs to category $c$ .

Table 1.

Categories per Domain.

Domains	Categories				Total
Domains	$c = 1$	$c = 2$	⋯	$c = C$	Total
$k = 1$	$Z_{11}$	$Z_{12}$	⋯	$Z_{1 C}$	1
$k = 2$	$Z_{21}$	$Z_{22}$	⋯	$Z_{2 C}$	1
.	.	.		.	.
.	.	.	⋱	.	.
.	.	.		.	.
$k = K$	$Z_{K 1}$	$Z_{K 2}$	⋯	$Z_{KC}$	1

Before we discuss the situation where one of the samples is a probability sample and the other a nonprobability one, we first consider the situation with two independent probability samples $S^{(P_{1})}$ and $S^{(P_{2})}$ , both obtained by simple random sampling (SRS). We then would have two unbiased estimators ${\hat{Z}}_{kc}^{(P_{1})}$ and ${\hat{Z}}_{kc}^{(P_{2})}$ given by the proportion of category $c$ in domain $k$ in $S^{(P_{1})}$ , respectively $S^{(P_{2})}$ . For many sampling designs, and certainly for SRS, we can estimate the sampling variance of ${\hat{Z}}_{kc}^{(P_{1})}$ and ${\hat{Z}}_{kc}^{(P_{2})}$ , and therefore we can estimate the Mean Squared Errors (MSEs) of ${\hat{Z}}_{kc}^{(P_{1})}$ and ${\hat{Z}}_{kc}^{(P_{2})}$ (for unbiased estimators, MSEs would be equal to the corresponding variances).

Next, we could construct a combined estimator of the form

{\hat{D}}_{kc} = W_{kc} {\hat{Z}}_{kc}^{(P_{1})} + (1 - W_{kc}) {\hat{Z}}_{kc}^{(P_{2})} .

(1)

where $W_{kc}$ is a weight between zero and one. We could find optimal weights $W_{kc}$ by minimizing the MSE of ${\hat{D}}_{kc}$ . These optimal weights are given by (see Särndal et al. 1992, Subsection 9.9.1):

W_{kc} = \frac{MSE ({\hat{Z}}_{kc}^{(P_{2})})}{MSE ({\hat{Z}}_{kc}^{(P_{1})}) + MSE ({\hat{Z}}_{kc}^{(P_{2})})} .

(2)

When $MSE ({\hat{Z}}_{kc}^{(P_{1})})$ is relatively large $W_{kc}$ will be close to 0, and when $MSE ({\hat{Z}}_{kc}^{(P_{1})})$ is relatively small $W_{kc}$ will be close to 1. The estimator ${\hat{D}}_{kc}$ will have an MSE that is less than or equal to the lowest MSE of the separate estimators (see, e.g., Pfeffermann 2013).

In our situation, we have a (small) probability sample $S^{(P)}$ and a (large) nonprobability sample $S^{(NP)}$ . We estimate a proportion from the probability sample ( ${\hat{Z}}_{kc}^{(P)})$ and from the nonprobability one $({\hat{Z}}_{kc}^{(NP)}$ ), in both cases simply by directly calculating the (unweighted) proportion of category $c$ in domain $k$ in these samples. We assume that the probability sample $S^{(P)}$ is a simple random sample stratified by domain and that $n_{k}^{(P)} << N_{k}$ , where $n_{k}^{(P)}$ stands for the sample size of domain $k$ in the probability sample. We are especially interested in situations where $n_{k}^{(NP)} >> n_{k}^{(P)}$ , where $n_{k}^{(NP)}$ stands for the sample size of domain $k$ in the nonprobability sample $S^{(NP)}$ . Estimator ${\hat{Z}}_{kc}^{(P)}$ is generally not biased but will have a large variance, whereas estimator ${\hat{Z}}_{kc}^{(NP)}$ is likely to be biased but generally has a smaller variance because it is based on a large number of units.

Again, we define a combined estimator ${\hat{D}}_{kc}$ of the form

{\hat{D}}_{kc} = W_{kc} {\hat{Z}}_{kc}^{(P)} + (1 - W_{kc}) {\hat{Z}}_{kc}^{(NP)} .

(3)

The challenge that we now have in constructing the weight $W_{kc}$ is that the bias of ${\hat{Z}}_{kc}^{(NP)}$ is unknown and therefore $MSE ({\hat{Z}}_{kc}^{(NP)})$ cannot be estimated directly. To overcome this problem, we assume a model for the bias of ${\hat{Z}}_{kc}^{(NP)}$ and use this to compute expected MSEs (EMSEs).

We define the EMSE of an estimator ${\hat{Z}}_{kc}^{(*)}$ , where superscript * either refers to the probability sample $S^{(P)}$ or to the nonprobability sample $S^{(NP)}$ , for the proportion of units with category $c$ in domain $k$ by

EMSE ({\hat{Z}}_{kc}^{(*)}) \equiv E_{b} E_{d} [{({\hat{Z}}_{kc}^{(*)} - Z_{kc})}^{2}],

where subscript $d$ indicates repeated sampling under the known sampling design for the probability sample and the unknown sampling “design” for the nonprobability sample, and subscript $b$ the posited model for the bias of ${\hat{Z}}_{kc}^{(NP)}$ .

We can then base the weight in our combined estimator on these EMSEs, analogously to Equation (2):

W_{kc} = \frac{EMSE ({\hat{Z}}_{kc}^{(NP)})}{EMSE ({\hat{Z}}_{kc}^{(NP)}) + EMSE ({\hat{Z}}_{kc}^{(P)})} .

(4)

Note that here it is assumed that ${\hat{Z}}_{kc}^{(NP)}$ and ${\hat{Z}}_{kc}^{(P)}$ are independent estimators, which is true if the two samples were drawn independently of each other. A similar idea for a combined estimator was proposed in a different setting by Pannekoek and De Waal (1998). Note that the EMSE of an estimator is calculated under an assumed model. This means that the same estimator will generally have a different EMSE for different models.

We will now derive a weight of the form Equation (4) under a model for the bias

b_{kc} = E_{d} ({\hat{Z}}_{kc}^{(NP)}) - Z_{kc}

(5)

of ${\hat{Z}}_{kc}^{(NP)}$ . As a starting point we note that, since every unit in domain $k$ has to belong to some category $c$ , the overestimation of one category will be compensated for by the underestimation of one or more other categories. More precisely, we have

\sum_{c = 1}^{C} b_{kc} = 0, (k = 1, \dots, K) .

(6)

For our basic model, we assume the expected value of the bias is constant across domains but potentially different for different categories. We therefore assume that $b_{kc}$ is distributed as a random variable with mean and variance

E_{b} (b_{kc}) = β_{c}

E_{b} [{(b_{kc} - β_{c})}^{2}] = σ^{2}

where $\sum_{c = 1}^{C} β_{c} = 0$ because of Equation (6). Our model is suitable for situations where the domains do not explain the selectivity in the nonprobability sample. Note that our model is closely related to models used in small area estimation (see, e.g., Subsection 5.3 in Rao 2003). An important difference is that in small area estimation the data are usually from a single probability sample, whereas in our case the data are from a probability sample and a nonprobability sample.

Under our model for the bias, we can derive the following expressions for the model-based EMSE of ${\hat{Z}}_{kc}^{(NP)}$ and ${\hat{Z}}_{kc}^{(P)}$ (see Appendix A):

EMSE ({\hat{Z}}_{kc}^{(NP)}) = β_{c}^{2} + σ^{2} + \frac{v_{kc}}{n_{k}^{(NP)} - 1};

(7)

EMSE ({\hat{Z}}_{kc}^{(P)}) = \frac{E_{b} [Z_{kc} (1 - Z_{kc})]}{n_{k}^{(P)}} = \frac{1}{n_{k}^{(P)}} (\frac{n_{k}^{(NP)}}{n_{k}^{(NP)} - 1} v_{kc} + β_{c} [2 E_{b} ({\tilde{Z}}_{kc}) - 1] - β_{c}^{2} - σ^{2}),

(8)

where ${\tilde{Z}}_{kc} = E_{d} ({\hat{Z}}_{kc}^{(NP)})$ and $v_{kc} = E_{b} E_{d} [{\hat{Z}}_{kc}^{(NP)} (1 - {\hat{Z}}_{kc}^{(NP)})]$ . For large nonprobability samples, the factor $n_{k}^{(NP)} / (n_{k}^{(NP)} - 1)$ in Equation (8) could be ignored.

Equation (8) may seem surprising, since an obvious estimator for $E_{b} [Z_{kc} (1 - Z_{kc})] / n_{k}^{(P)}$ is ${\hat{Z}}_{kc}^{(P)} (1 - {\hat{Z}}_{kc}^{(P)}) / n_{k}^{(P)}$ . However, if ${\hat{Z}}_{kc}^{(P)} (1 - {\hat{Z}}_{kc}^{(P)})$ were an accurate estimator for $Z_{kc} (1 - Z_{kc})$ , ${\hat{Z}}_{kc}^{(P)}$ would certainly be an accurate estimator for $Z_{kc}$ and there would be no need to combine ${\hat{Z}}_{kc}^{(P)}$ with an estimator based on the nonprobability sample. Our model for $b_{kc}$ allows us to choose on which sample, the probability sample or the nonprobability sample, we will base our estimates for $EMSE ({\hat{Z}}_{kc}^{(P)})$ and $EMSE ({\hat{Z}}_{kc}^{(NP)})$ upon. We have opted to base those estimates on the nonprobability sample in combination with our model for $b_{kc}$ . Besides aiming to improve the accuracy of the estimate for $E_{b} [Z_{kc} (1 - Z_{kc})] / n_{k}^{(P)}$ , we also did this to keep the estimates for $EMSE ({\hat{Z}}_{kc}^{(P)})$ and $EMSE ({\hat{Z}}_{kc}^{(NP)})$ as comparable to each other as possible.

Substituting the above expressions Equations (7) and (8) into Equation (4) we obtain:

W_{kc} = \frac{(n_{k}^{(NP)} - 1) (β_{c}^{2} + σ^{2}) + v_{kc}}{(n_{k}^{(NP)} - 1) (1 - \frac{1}{n_{k}^{(P)}}) (β_{c}^{2} + σ^{2}) + (1 + \frac{n_{k}^{(NP)}}{n_{k}^{(P)}}) v_{kc} + \frac{n_{k}^{(NP)} - 1}{n_{k}^{(P)}} β_{c} [2 E_{b} ({\tilde{Z}}_{kc}) - 1]} .

(9)

To apply this in practice we need to estimate $β_{c}$ , $σ^{2}$ , $E_{b} ({\tilde{Z}}_{kc})$ , and $v_{kc}$ from the two given samples. First, to estimate $β_{c}$ we can use

{\hat{β}}_{c} = \frac{1}{K} \sum_{k = 1}^{K} ({\hat{Z}}_{kc}^{(NP)} - {\hat{Z}}_{kc}^{(P)}),

(10)

That is, the average difference between the nonprobability and the probability sample for category $c$ across all domains. This is an unbiased estimator of $β_{c}$ , since it holds under the assumed model that

\begin{matrix} E_{b} E_{d} ({\hat{β}}_{c}) = \frac{1}{K} \sum_{k = 1}^{K} E_{b} [E_{d} ({\hat{Z}}_{kc}^{(NP)}) - E_{d} ({\hat{Z}}_{kc}^{(P)})] = \\ \frac{1}{K} \sum_{k = 1}^{K} E_{b} ({\tilde{Z}}_{kc} - Z_{kc}) = \frac{1}{K} \sum_{k = 1}^{K} E_{b} (b_{kc}) = β_{c} . \end{matrix}

Provided that the number of domains $K > 1$ , asymptotically as $n_{k}^{(P)}, n_{k}^{(NP)} \to \infty$ a consistent estimator of $σ^{2}$ is given by:

{\hat{σ}}^{2} = \frac{1}{(K - 1) C} \sum_{k = 1}^{K} \sum_{c = 1}^{C} {({\hat{Z}}_{kc}^{(NP)} - {\hat{Z}}_{kc}^{(P)} - {\hat{β}}_{c})}^{2} .

(11)

For finite samples this estimator is likely to be biased upwards in practice, since it is also affected by random sampling errors in ${\hat{Z}}_{kc}^{(P)}$ and ${\hat{Z}}_{kc}^{(NP)}$ ; see Appendix A for details. Consequently, in small samples we may somewhat overestimate $EMSE ({\hat{Z}}_{kc}^{(NP)})$ and somewhat underestimate $EMSE ({\hat{Z}}_{kc}^{(P)})$ ; cf. Equations (7) and (8). This is not necessarily a drawback, as it could provide some additional robustness against bias in the combined estimator, at the cost of a larger variance.

By definition unbiased estimators of $E_{b} ({\tilde{Z}}_{kc}) = E_{b} E_{d} ({\hat{Z}}_{kc}^{(NP)})$ and $v_{kc} = E_{b} E_{d} [{\hat{Z}}_{kc}^{(NP)} (1 - {\hat{Z}}_{kc}^{(NP)})]$ are given by ${\hat{Z}}_{kc}^{(NP)}$ itself and

{\hat{v}}_{kc} = {\hat{Z}}_{kc}^{(NP)} (1 - {\hat{Z}}_{kc}^{(NP)}) .

(12)

Finally, after calculating $W_{kc}$ with the above estimates, we compute the combined estimator ${\hat{D}}_{kc}$ given by Equation (3).

It should be noted that the estimated values ${\hat{D}}_{kc}$ do not necessarily sum up to one for each domain $k$ . This can be corrected for by using, for instance, prorating (Pannekoek 2014): ${\hat{D}}_{kc}^{*} = {\hat{D}}_{kc} / \sum_{c^{'}} {\hat{D}}_{k c^{'}}$ . In our simulation study and real application the sum of ${\hat{D}}_{kc}$ over the categories was always close to one for each domain $k$ . We did adjust the combined estimator using prorating per domain.

Our approach can be extended by using auxiliary data in various ways. In our simulation study, we focus on a simple extension. For that extension, we assume that an auxiliary variable that partly explains the selectivity of the nonprobability sample is available in the probability sample. An example of such a situation is when the target variable is Educational level (observed in a probability sample and in a nonprobability administrative dataset), the auxiliary variable is Age class (observed in the probability sample), the number of people is known for each combination of Age class and domain, and the inclusion in the administrative dataset on Education level depends on Age class. In such a situation, instead of using Equation (10) to estimate $β_{c}$ one can use

{\hat{β}}_{c} = \frac{1}{K} \sum_{k = 1}^{K} \sum_{a = 1}^{A} \frac{N_{k . a}}{N_{k}} ({\hat{Z}}_{kc}^{(NP)} - {\hat{Z}}_{kc, a}^{(P)}),

(13)

where $A$ denotes the number of categories of the auxiliary variable, ${\hat{Z}}_{kc, a}^{(P)}$ is the estimated proportion of units in category $c$ of the target variable within domain $k$ and category $a$ of the auxiliary variable, and $N_{k . a}$ is the number of units in category $a$ of the auxiliary variable and domain $k$ in the population. Note that, if the selectivity of the nonprobability sample is explained by the auxiliary variable, $\sum_{a = 1}^{A} \frac{N_{k . a}}{N_{k}} ({\hat{Z}}_{kc, a}^{(NP)} - {\hat{Z}}_{kc}^{(NP)}) \approx 0$ should hold for all $k$ , where ${\hat{Z}}_{kc, a}^{(NP)}$ is defined analogously to ${\hat{Z}}_{kc, a}^{(P)}$ .

We also examined an even simpler version of our basic model for bias $b_{kc}$ , namely a model where $E_{b} (b_{kc}) = 0$ for all $k = 1, \dots, K$ and $c = 1, \dots, C$ . In that simpler model, which is a special case of the model used in this article, fewer model parameters need to be estimated, which in some cases may lead to slightly more accurate estimates. However, we found that our more complicated model outperforms the simpler model. In this article, we therefore focus on our more complicated model only.

3. Model Assessment with Simulated Data

3.1. Simulated Conditions

We simulated a population and repeatedly drew a probability sample and a nonprobability one. Then, we estimated the EMSE for ${\hat{Z}}_{kc}^{(NP)}$ and ${\hat{Z}}_{kc}^{(P)}$ to obtain the weight that the estimator for each sample should have, and we computed the combined estimator ${\hat{D}}_{kc}$ .

We generated populations consisting of $N = 100, 000$ units. For each unit we generated an outcome variable that defines to which category $c$ of the target variable and which domain $k$ it belongs. We also generated two auxiliary variables, one of which was used to manipulate the level of selectivity in the nonprobability sample. We considered $K \in {2, 4, 10, 15}$ domains, $C \in {3, 5, 8, 15}$ categories, a scenario where all these categories are all of equal size, and a scenario where categories are of unequal sizes. For each of the domains, the sizes of unequal size categories with 3, 5, and 8 categories were based on the real distribution of CBS data for the variable educational attainment for the Dutch population. We also simulated sample sizes per domain of $n_{k}^{(P)} \in {20, 100, 400, 900}$ and $n_{k}^{(NP)} \in {100, 1000, 2000, 6000}$ , and we introduced three different levels of selectivity.

To generate the data, the units were first assigned randomly to $K$ domains of approximately equal size. For each unit $i$ , three continuous variables $x_{1}$ , $x_{2}$ , and $y$ were drawn from the following model:

x_{1 i} = (1 - f) ϵ_{1 i} + f u_{k (i)},

x_{2 i} = ϵ_{2 i},

y_{i} = \sqrt{1 / 3} x_{1 i} + \sqrt{1 / 3} x_{2 i} + \sqrt{1 / 3} ϵ_{3 i},

where $ϵ_{1 i}$ , $ϵ_{2 i}$ , $ϵ_{3 i}$ , and $u_{k}$ are independent draws from a standard normal distribution and $k (i)$ indicates the domain $k$ to which unit $i$ belongs. The coefficient $f$ was chosen equal to $\sqrt{1 / 10}$ . Hence, the total variance of $y$ was set to be 1, of which two-thirds are explained by the independent variables $x_{1}$ and $x_{2}$ . Furthermore, the variable $x_{1}$ contains a domain-specific effect such that 10% of its total variance is explained by the domain to which a unit belongs.

The continuous variable $y$ was then transformed into a categorical outcome variable with a different number of categories according to the above-mentioned simulation scenarios, taking the category size into account. The independent variable $x_{1}$ was also transformed into a categorical variable $a$ with three categories of sizes 35%, 40%, and 25%. This categorical auxiliary variable was used as the “Age class” variable in the extended model discussed at the end of Subsection 2. Note that the domain-specific effect in $x_{1}$ was included to obtain some variation in the distribution of $a$ across domains.

From the target population, probability samples were drawn with equal inclusion probabilities using SRS, whereas nonprobability samples were drawn with unequal probabilities using randomized systematic sampling, following the approach in Smit (2021), where the probability of inclusion is proportional to $λ_{i}$ given by

λ_{i} = \frac{1}{1 + \exp {- g (a_{i}) (y_{i} - 0.75)}},

(14)

The coefficient $g$ depends on the auxiliary variable $a$ :

g (a_{i}) = {\begin{matrix} 2 if a_{i} = 1 \\ 2 - δ if a_{i} = 2 \\ 2 - 2 δ if a_{i} = 3 \end{matrix}

Three different scenarios were considered for the step size $δ$ : $δ \in {0.5, 1.0, 1.5}$ , leading to increasing differences between inclusion probabilities for different categories of the auxiliary variable. In addition, since $λ_{i}$ depends directly on the continuous variable $y$ on which the categorical target variable was based, the selectivity in the nonprobability sample was non-ignorable for estimating domain proportions $Z_{kc}$ .

To illustrate the different scenarios, Figure 1 shows the shape of $λ_{i}$ from Equation (14) as a function of $y_{i}$ , for a given value of $a_{i} \in {1, 2, 3}$ and a given choice of the step size $δ \in {0.5, 1.0, 1.5}$ . Note that for $a_{i} = 1$ the function does not depend on $δ$ . For $δ = 0.5$ , $λ_{i}$ is a monotone increasing function of $y_{i}$ in all three strata based on $a_{i}$ ; hence the contributions of the three strata to the bias in ${\hat{Z}}_{kc}^{(NP)}$ are expected to be concordant in this scenario. For $δ = 1.0$ , $λ_{i}$ is monotone increasing in the first two strata but constant in the third stratum. Finally for $δ = 1.5$ , $λ_{i}$ is monotone increasing in the first two strata—albeit with a much decreased slope for $a_{i} = 2$ —while it is monotone decreasing for $a_{i} = 3$ . Hence, in this final scenario the contributions of the three strata to the bias in ${\hat{Z}}_{kc}^{(NP)}$ are expected to cancel out to some extent. For context, the dashed vertical lines in the figure show the 2.5% and 97.5% quantiles of the distribution of $y_{i}$ in each stratum.

Figure 1.

Illustration of $λ_{i}$ as a function of $y_{i}$ , $a_{i}$ , and $δ$ .

The overall effect of using inclusion probabilities based on Equation (14) was that $δ = 0.5$ yielded the largest absolute bias values in ${\hat{Z}}_{kc}^{(NP)}$ and $δ = 1.5$ the smallest, with $δ = 1.0$ somewhere in between.

Combining all different simulation conditions mentioned above for $K$ , $C$ , $n_{k}^{(P)}$ , $n_{k}^{(NP)}$ , $δ$ , and equal or unequal category sizes, we considered a total of 4 × 4 × 4 × 4 × 3 × 2 = 1,536 scenarios and drew 1,000 simulations for each of them.

3.2. Evaluation

We evaluated the root mean squared error and the bias. First, the root mean squared error was used where $R = 1000$ is the number of simulations and $r$ denotes a specific simulation round ( $r = 1, \dots, R$ ):

RMS E_{kc} = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {(Z_{kc} - {\hat{D}}_{kc, r})}^{2}} .

(15)

Here ${\hat{D}}_{kc, r}$ may also denote a direct estimator based on one of the samples only instead of the combined estimator.

We combined the $RMS E_{kc}$ into one average performance measure per domain $ARMS E_{k}$ defined by

ARMS E_{k} = \frac{1}{C} \sum_{c = 1}^{C} RMS E_{kc} .

(16)

Furthermore, we combined the $ARMS E_{k}$ into one overall performance measure by taking the mean of the values by domain:

MARMSE = \frac{1}{K} \sum_{k = 1}^{K} ARMS E_{k} .

(17)

Instead of using the $MARMSE$ it is generally better to use a weighted version, for instance a version weighted with the domain size, that is, $\sum_{k = 1}^{K} (N_{k} / N) \times ARMS E_{k}$ . However, in our simulation study the domain sizes are equal in expectation and conclusions based on the weighted $MARMSE$ are (almost) identical to those based on Equation (17).

Finally, the bias was assessed as the mean of absolute bias (MAB):

MAB = \frac{1}{RKC} \sum_{r = 1}^{R} \sum_{k = 1}^{K} \sum_{c = 1}^{C} | Z_{kc} - {\hat{D}}_{kc, r} | .

(18)

In the next sections we evaluate the performance of our approach in terms of MARMSE and MAB.

3.3. MARMSE and Bias Compared to Single-Sample Estimators

Table 2 presents the percentage of times where the extended combined estimator based on Equation (13) showed a lower MARMSE than the probability sample (PS), the nonprobability sample (NPS), and both samples (Both). Each row in Table 2 varies in number of categories (4 settings), number of domains (4 settings), the size of the probability sample (4 settings), and the size of the nonprobability sample (4 settings), whereas the selectivity and size of categories are kept fixed per row. So, each row summarizes the results over $4 \times 4 \times 4 \times 4 = 256$ settings in total.

Table 2.

Proportion (×100%) of Combined Estimators with Lower MARMSE than Single-Sample Estimators.

Selectivity	Size of categories	PS	NPS	Both
$δ = 0.5$	Equal	97	96	94
	Unequal	92	96	88
$δ = 1.0$	Equal	100	91	91
	Unequal	99	91	90
$δ = 1.5$	Equal	100	82	82
	Unequal	100	82	82

The proportion of times that the combined estimator performed better than an estimator from a nonprobability sample was about 82% in the cases with the weakest selectivity ( $δ = 1.5$ ), rising to about 96% in the cases with strongest selectivity ( $δ = 0.5$ ). As expected, when a strong selectivity in the nonprobability sample was introduced this resulted in a large bias, and the combined estimator was pulled toward the probability sample by assigning it more weight and having a less biased estimate as result (see the end of this subsection). We also observe that the combined estimator performed better than the estimator from the probability sample in a very large proportion of cases under all scenarios considered here. In the cases with the strongest selectivity ( $δ = 0.5$ ), the proportion of cases where the combined estimator performed better than the probability sample was the smallest—but still over 90%—, because the selective nonprobability sample was not always contributing to decreasing the MARMSE. The distinction between equal or unequal category sizes did not have a strong effect on the results with respect to MARMSE.

The reduction in MARSME that the combined estimator implied in each scenario can be observed in Table 3, which contains the average result of subtracting the MARMSE of the probability sample from that of the combined estimator in column 3, and from subtracting the MARMSE of the nonprobability sample in column 4.

Table 3.

Average Difference Between MARMSE of the Combined Estimator (C) and the Direct Estimators for the Probability (PS) and Nonprobability (NPS) Samples.

Selectivity	Size of categories	Difference C−PS	Difference C−NPS
$δ = 0.5$	Equal	−0.0044	−0.0769
	Unequal	−0.0039	−0.0763
$δ = 1.0$	Equal	−0.0057	−0.0481
	Unequal	−0.0055	−0.0470
$δ = 1.5$	Equal	−0.0070	−0.0256
	Unequal	−0.0069	−0.0242

We observe that the average reduction of MARMSE by using the combined estimator was an order of magnitude larger for the nonprobability sample than for the probability sample. For decreasing levels of selectivity (going from $δ = 0.5$ to $δ = 1.5$ ), the size of the average reduction with respect to the nonprobability sample became smaller, whereas the opposite effect was seen for the average reduction with respect to the probability sample.

Table B.1 in Appendix B displays the MARMSE for our approach for all conditions with equal-size categories and the strongest selectivity ( $δ = 0.5$ ). That table shows that there was no situation in which the MARMSE from the combined estimator was larger than those of both estimators from the two samples. In fact, for independent estimators where at least one of them is unbiased, which is the case here, it is not difficult to show that the MSE (and therefore also the MARMSE) of the combined estimator is always at least as good as the worst of its constituent estimators.

For the scenarios with $δ = 0.5$ it is seen that, when the probability sample size per domain is as small as 20 and the nonprobability sample size is large, it can happen that the combined estimator has a better MARMSE than using only the probability sample, but it does not improve on the estimator from the nonprobability sample. This makes sense because if the probability sample is too small, its contribution is not improving the combined estimator due to its large variance.

For larger probability sample sizes than 20, the combined estimator nearly always improved on both estimators when $K > 2$ . When the estimators were computed for two domains ( $K = 2$ ) and the number of categories of the target variable was at most five ( $C \leq 5$ ), there were some conditions where the estimator from the probability sample had a smaller MARMSE than the combined estimator. This can be explained because our model for bias $b_{kc}$ shares information across domains, as can be observed in formulas (10) and (13). Consequently, working with a small number of domains may not provide enough information to significantly improve the estimators.

Tables B.2 and B.3 in Appendix B show similar results for all conditions with equal-size categories and $δ = 1.0$ and $δ = 1.5$ , respectively. The patterns here are similar to what was observed in Table B.1. As $δ$ increased and the nonprobability sample became less selective with respect to the target variable, the number of conditions where the combined estimator did not improve on the nonprobability sample increased, although these conditions mostly occurred when the probability sample size per domain was as small as 20. In addition, for $δ \geq 1$ the MARMSE of the combined estimator was always smaller than the MARSME of the probability sample.

We have also constructed analogous MARMSE tables for the conditions with unequal category sizes. The patterns in these tables were very similar and did not lead to new insights or different conclusions.

Finally, in terms of bias the results of the simulation study were clear-cut: the combined estimator had a smaller MAB than the estimator based on the non-probability sample in 100% of the conditions examined here.

3.4. Performance of the Combined Estimator Related to the Average Weight

An interesting question for applications is whether situations in which the combined estimator is likely to outperform both individual estimators could be recognized in practice from the available data. A first sight it may seem natural to use the EMSEs of the proposed estimators as indicators for their quality. We can indeed use the EMSEs for this purpose, but only to compare estimators under the same model. We cannot use the EMSEs to compare the performance of estimators computed under different models. In order to compare the performance of estimators computed under different models we need an indicator that is more directly related to the computed estimates.

From the MARMSE results in this simulation study, we have found some evidence that the average value of the weight of the probability sample, $\bar{W} = \sum_{k = 1}^{K} \sum_{c = 1}^{C} W_{kc} / KC$ , could be useful as a diagnostic for this purpose.

Figure 2 below shows a scatter plot of the simulation results for all scenarios. The horizontal axis shows the average weight $\bar{W}$ and the vertical axis shows the MARMSE of the combined estimator; each point represents one of the conditions in the simulation study. The shade of each point shows which estimator had the smallest MARMSE for that condition.

Figure 2.

Relation between weight and MARMSE of combined estimator.

In Figure 2, the following pattern is seen. When $\bar{W} \geq 0.75$ , the combined estimator nearly always had the smallest MARMSE of the three estimators. When $\bar{W} \leq 0.6$ 5, the estimator based on the nonprobability sample nearly always performed best. When $0.65 < \bar{W} < 0.75$ , it depended on the specific scenario which of the three estimators performed best. This suggests that for $K \geq 2$ domains, the combined estimator could be used with some confidence whenever $\bar{W} \geq 0.75$ and it should be avoided whenever $\bar{W} \leq 0.65$ .

3.5. MARMSE and Bias Compared to Other Combined Estimators

Finally, we compared the proposed combined estimator to two other estimators that combine information from the two samples using only aggregated data: the estimator from M. N. Elliott and Haviland (2007; EH) and an estimator based on iterative proportional fitting (IPF).

The EH estimator is also based on Equation (3). Although M. N. Elliott and Haviland (2007) largely assumed that the bias of the estimator based on the nonprobability sample is known, they did suggest a simple estimator for this bias, namely ${\hat{b}}_{kc} = {\hat{Z}}_{kc}^{(NP)} - {\hat{Z}}_{kc}^{(P)}$ (see p. 214 in M. N. Elliott and Haviland 2007), and used this to estimate $MSE ({\hat{Z}}_{kc}^{(NP)})$ by $\hat{MSE} ({\hat{Z}}_{kc}^{(NP)}) = {\hat{Z}}_{kc}^{(NP)} (1 - {\hat{Z}}_{kc}^{(NP)}) / n_{k}^{(NP)} + ({\hat{Z}}_{kc}^{(NP)} - {\hat{Z}}_{kc}^{(P)})^{2}$ . They estimated $MSE ({\hat{Z}}_{kc}^{(P)})$ by $\hat{MSE} ({\hat{Z}}_{kc}^{(P)}) = {\hat{Z}}_{kc}^{(P)} (1 - {\hat{Z}}_{kc}^{(P)}) / n_{k}^{(P)}$ , and then used Equation (2) to estimate $W_{kc}$ .

In the IPF approach we construct a two-dimensional table where the initial internal values are given by ${\hat{Z}}_{kc}^{(NP)}$ and the fixed marginal totals by ${\hat{Z}}_{+ c}^{(P)} = \sum_{k = 1}^{K} {\hat{Z}}_{kc}^{(P)}$ and ${\hat{Z}}_{k +}^{(P)} = \sum_{c = 1}^{C} {\hat{Z}}_{kc}^{(P)} = 1$ . Next, IPF is applied to align the internal cell values with the fixed marginal totals.

Tables 4 and 5 show the proportion of times that the proposed combined estimator outperformed one or both of these alternative estimators in terms of MARMSE (Table 4) and MAB (Table 5), in a similar format to Table 2. It is seen that the combined estimator nearly always had a smaller MARMSE and usually had a smaller MAB than the EH estimator. The combined estimator also outperformed the IPF estimator in terms of bias most of the time, in particular when the level of selectivity of the nonprobability sample became smaller (rising from $δ = 0.5$ to $δ = 1.5$ ). In terms of MARMSE, the combined estimator and IPF each outperformed the other in about half of the conditions examined here.

Table 4.

Proportion (×100%) of Combined Estimators with Lower MARMSE Than Other Estimators Using Both Samples.

Selectivity	Size of categories	EH	IPF	Both
$δ = 0.5$	Equal	100	44	44
	Unequal	100	48	48
$δ = 1.0$	Equal	95	52	47
	Unequal	100	54	54
$δ = 1.5$	Equal	100	59	59
	Unequal	96	58	54

Table 5.

Proportion (×100%) of Combined Estimators with Lower MAB Than Other Estimators Using Both Samples.

Selectivity	Size of categories	EH	IPF	Both
$δ = 0.5$	Equal	88	63	57
	Unequal	91	61	55
$δ = 1.0$	Equal	85	75	70
	Unequal	91	74	70
$δ = 1.5$	Equal	67	88	59
	Unequal	85	87	75

4. Application to Real Data

4.1. Data

The Educational Attainment File (EAF) provided by CBS combines data from the Labour Force Survey and various administrative data sources on the educational level of people. Here, we consider the EAF of the year 2019. The target population consists of all the people who were registered in the Municipal Personal Records Database (the official population register of the Netherlands) on the 1st of October 2019.

The Labour Force Survey (LFS) is a rotating panel survey with five waves, with a target population of people fifteen years or older who live in the Netherlands. The EAF contains LFS data from several years, which have been integrated with other available information from administrative data sources (Linder et al. 2014).

The administrative data sources are nonprobability samples including the following: files with educational histories as submitted by job seekers at the UWV Work company, Central Register of Registrations in Higher Education, Education number files from secondary education, and Education number files from primary education and special education. These records also compile the measurement of educational attainment and contain 11,092,584 observations (i.e., about 64% of the target population of the EAF). It is known that, for historical reasons, these administrative data are selective with respect to age, migration background, and educational attainment itself (older people, people not born in the Netherlands and people with lower attained education levels are more likely to be missing); see Linder et al. (2014) for a detailed analysis of missing data.

For this study, the administrative records in the EAF are treated as a nonprobability sample and ${\hat{Z}}_{kc}^{(NP)}$ is the estimate based only on these records. As a probability sample for the estimate ${\hat{Z}}_{kc}^{(P)}$ , we use a separate dataset consisting of LFS data from 2016. We focus on the target population of persons fifteen years or older. The variables used to create the domains and categories are Municipality and Highest-attained level of education respectively. The variable Municipality has 355 domains.

We selected eleven municipalities of different sizes and used those as our domains. In the Netherlands, Municipality is quite unrelated to Education Level, so we are in a situation that is suitable for our method (see also Subsection 2). After selection of the eleven municipalities, we ended up with 17,193 observations from the probability sample, and 623,114 from the nonprobability sample. Highest-attained level of education has originally eighteen categories of unequal size per domain, which we have recoded into three categories, namely lower (1), middle (2), and higher education (3).

For its regular output based on the EAF, CBS computes estimates by weighting the LFS data to represent the subset of the population that is not covered by the administrative records (Linder et al. 2014). The quality of these regular estimates is considered to be very high (see Kuijvenhoven and Scholtus 2010; Linder et al. 2014). We therefore consider these estimates to be the true values and compare our proposed combined estimator based on Equation (10) to these values.

4.2. Results

The situation at hand concerns a target variable with three categories, eleven domains, and mostly quite large sample sizes per domain. The regular CBS estimates and estimates from the estimators for the probability and nonprobability sample and the combined estimator are given in Table 6. In Table 6n(P) and n(NP) denote the sample sizes of the probability sample, respectively the nonprobability sample. Below the name of each municipality the number of inhabitants in 2019 is given.

Table 6.

Estimates of Educational Attainment per Municipality; All Estimates Presented as Rounded Percentages.

Municipality	n(P)	n(NP)	CBS estimates			P sample			NP sample			Combined estim.
Municipality	n(P)	n(NP)	1	2	3	1	2	3	1	2	3	1	2	3
Amsterdam (862,965)	11,051	473,324	24	29	46	23	29	49	23	30	46	23	29	49
Amstelveen (94,418)	1,365	37,815	23	33	44	20	28	52	20	34	45	20	28	52
Krimpenerwaard (57,700)	1,128	23,925	34	43	23	35	42	23	29	44	27	35	42	24
Medemblik (46,031)	845	19,961	35	45	20	35	41	24	34	43	23	35	41	24
Teylingen (38,510)	684	16,990	26	42	33	22	41	37	25	39	37	23	41	37
Boxtel (33,748)	683	14,174	34	40	26	37	39	24	31	40	30	36	39	25
Wassenaar (27,093)	336	10,177	26	33	41	23	52	25	25	33	42	23	50	26
Sluis (23,243)	448	8,822	35	46	19	41	41	18	32	45	23	40	41	18
West Maas en Waal (20,065)	329	8,900	33	46	21	45	40	16	30	45	26	43	40	17
Ouder-Amstel (14,276)	264	6,552	23	34	43	28	30	42	21	33	45	27	30	43
Terschelling (4,928)	60	2,474	27	51	22	21	39	39	25	49	25	23	44	34

We observe that estimates of the combined estimator are, overall, close to the regular CBS estimates, which we consider the true estimates. Large absolute differences (±10 percentage points or more) are observed in four out of the thirty-three estimates in total. This is the case for categories 2 and 3 in Wassenaar, category 1 in West Maas en Waal, and category 3 in Terschelling. The estimates are not particularly affected by the sample size, since of these three municipalities only Terschelling has a sample size that we would consider small ( $n = 60$ ) and two other municipalities in the table have similar sample sizes to Wassenaar and West Maas en Waal and they do not seem to be affected by this. In Table 6 we see that the biggest differences in the combined estimator occur when the estimates from the probability sample are also very different from the regular CBS estimates. This means that in these situations, the combined estimator is being computed with a very high weight for the probability sample because the method is estimating a large bias in the nonprobability sample.

The differences between the estimates from the probability sample and the regular CBS estimates can be explained also by three other aspects: first by the different procedures of estimation, second, by differences in the data processing and definitions used, and third, because our target population is from 2019 and we are using a probability sample from 2016, although this still is a reasonable approximation because the distribution of educational attainment changes slowly over the years.

Because these three aspects could have deviated the combined estimator from the estimates from CBS, we have also simulated two scenarios in which we eliminate these three aspects. For the simulation we first draw a new simple random sample with the same sample size (denoted by n(1) in Table 7) as the already mentioned probability sample, and one with a smaller sample size (denoted by n(2) in Table 7), directly from the observed distribution in the EAF. Then, we calculate the combined estimator again with each of the new samples as we can see in Table 7 with their respective sample sizes. For the numbers of inhabitants per municipality in 2019, see Table 6.

Table 7.

Combined Estimator with Simulated Probability Samples of Educational Attainment per Municipality; All Estimates Presented as Rounded Percentages.

Municipality	n(1)	n(2)	CBS estimates			Comb. estimator, same sample size (1)			Comb. estimator, smaller sample size (2)
Municipality	n(1)	n(2)	1	2	3	1	2	3	1	2	3
Amsterdam	11,051	1,105	24	29	46	24	29	46	26	26	48
Amstelveen	1,365	136	23	33	44	24	33	43	20	32	48
Krimpenerwaard	1,128	112	34	43	23	34	42	23	38	40	22
Medemblik	845	84	35	45	20	35	42	23	39	44	16
Teylingen	684	68	26	42	33	26	43	31	23	40	37
Boxtel	683	68	34	40	26	31	44	24	28	44	27
Wassenaar	336	33	26	33	41	26	33	41	31	26	44
Sluis	448	44	35	46	19	34	48	17	32	51	16
West Maas en Waal	329	32	33	46	21	30	44	26	29	44	27
Ouder-Amstel	264	26	23	34	43	21	34	46	24	31	46
Terschelling	60	6	27	51	22	26	49	25	24	51	24

In Table 7 we observe that the difference between the regular estimates from CBS with the ones from the combined estimator computed with a probability sample of the same size as in Table 6 is drastically reduced, with no absolute difference larger than 5 percentage points. Even for a smaller sample size n(2), reduced to a tenth of the original size n(1), estimators are still very close: the largest absolute difference is now 7 percentage points, and only three estimates have an absolute difference larger than 5 percentage points.

5. Discussion

In this study, we have proposed and evaluated a way to combine a probability and a nonprobability sample with the aim of reducing the mean squared error of an estimator of proportions. We proposed a model to estimate the bias of the direct estimator for a nonprobability sample and combine this estimator with the direct estimator for a probability sample, aiming to obtain a combined estimator with a smaller mean squared error.

Through simulation studies, we have shown that the combined estimator can lead to a reduction of the MSE, which we evaluated through the MARMSE of each simulated scenario. This evaluation indicates that using our model for bias $b_{kc}$ to compute the weights of the combined estimator is better than not doing it when we have a small probability sample and a not too biased estimator for the nonprobability sample, which is concordant with our initial expectations. Finally, we observed that the average weight $\bar{W}$ of the probability sample in the combined estimator could be useful as a diagnostic to judge the performance of our method in practice, with $\bar{W} \geq 0.75$ indicating near-certain improvement compared to both single-sample estimators in our simulation study.

These findings are supported by an empirical application to a real dataset on educational attainment and a simulation based on that dataset using regular CBS estimates as the true values. Here, we have shown that the combined estimator can obtain close estimates to those obtained with methods that have been tested and validated previously and are currently in use.

As already mentioned in the Introduction, our proposed method has the advantage that only aggregated data are required. Furthermore, our method is robust in the sense that it will never yield a worse result, that is, higher MSE, than the highest MSE that was obtained from one of the two samples separately. Finally, the proposed method is very easy to implement in software like R since it requires very few lines of code.

We see various possibilities for extending our method to a wider range of situations. First, in this article we examined a simple way to include available auxiliary data in the estimation procedure, namely by estimating $β_{c}$ by means of Equation (13) instead of Equation (10). Our approach can be extended to include auxiliary information in several other ways. We plan to explore some of those approaches in future research.

Second, in this article, we proposed approaches for categorical variables. These approaches could be extended to other types of variables, particularly if they are intended to be used beyond the field of official statistics. For instance, if several continuous variables are of interest and one wants to publish their totals or means per domain, this would require a different model to address the bias in the nonprobability sample.

Third, one important assumption was that we considered the probability sample to be unbiased. We assumed so during this study, but the same problems that probability samples have been facing lately that are already pointed out in the Introduction could mean that this assumption might not be true. The proposed approaches would need to be adapted if this assumption does not hold.

Fourth, for simplicity we developed the combined estimator for an elementary setup, where the probability sample is a simple random sample stratified by domain. In a broader context, ${\hat{Z}}_{kc}^{(P)}$ could be a ratio estimator under a complex sampling design and its variance would have to be computed by a more sophisticated formula than the one provided here. However, the basic idea behind the combined estimator could then still be applied in the same way.

Fifth, in this article we have assumed that ${\hat{Z}}_{kc}^{(NP)}$ and ${\hat{Z}}_{kc}^{(P)}$ are independent estimators. When the (expected) covariance between ${\hat{Z}}_{kc}^{(NP)}$ and ${\hat{Z}}_{kc}^{(P)}$ can be estimated, it may be possible to omit this assumption. We leave this extension of our paper to future work.

Finally, if one has data at the unit level available, one could in some situations consider using a method designed for that situation to remove part of the selectivity in the nonprobability sample, and then apply our proposed approach on aggregated estimates from the nonprobability sample and a probability sample to mitigate the remaining selectivity.

Footnotes

Appendix A

Appendix B

Acknowledgements

The authors thank a guest editor and three referees for their valuable comments.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Ton de Waal

Data Availability Statement

The research repository of this study can be found in the following GitHub repository: . The data from Subsection 4 are available in a secure environment at CBS. To get access to the data one needs to contact CBS. The used software was R version 4.2.3 (Subsection 3) and R version 4.1.3 (Subsection 4).

Received: June 2023

Accepted: September 2024

References

Baker

Brick

J. M.

Bates

N. A.

Battaglia

Couper

M. P.

Dever

J. A.

Gile

K. J.

Tourangeau

2013. “Summary Report of the AAPOR Task Force on Nonprobability Sampling.”Journal of Survey Statistics and Methodology 1: 90–143. DOI: https://doi.org/10.1093/jssam/smt008.

Brick

J. M.

2014. “Explorations in Nonprobability Sampling Using the Web.”In Prooceedings of Statistics Canada Symposium 2014. Beyond Traditional Survey Taking: Adapting to a Changing World. https://www.statcan.gc.ca/eng/conferences/symposium2014/program/14252-eng.pdf (accessed September 12, 2024).

Chen

2020. “Doubly Robust Inference with Nonprobability Survey Samples.”Journal of the American Statistical Association 115: 2011–2021. DOI: https://doi.org/10.1080/01621459.2019.1677241.

Cornesse

Blom

A. G.

Dutwin

Krosnick

J. A.

de Leeuw

E. D.

Legleye

Pasek

, et al. 2020. “A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research.”Journal of Survey Statistics and Methodology 8: 4–36. DOI: https://doi.org/10.1093/jssam/smz041.

Elliott

M. N.

Haviland

2007. “Use of a Web-Based Convenience Sample to Supplement a Probability Sample.”Survey Methodology 33: 211–5. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2007002/article/10498-eng.pdf?st=YjVro-2D (accessed September 12, 2024).

Elliott

M. R.

2009. “Combining Data from Probability and Non-Probability Samples Using Pseudo-Weights.”Survey Practice 2: 1–9. DOI: https://doi.org/10.29115/SP-2009-0025.

Elliott

M. R.

Valliant

2017. “Inference for Nonprobability Samples.”Statistical Science 32: 249–64. DOI: https://doi.org/10.1214/16-STS598.

Kim

J. K.

Tam

S. M.

2021. “Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference.”International Statistical Review 88: 382–401. DOI: https://doi.org/10.1111/insr.12434.

Kuijvenhoven

Scholtus

2010. “Estimating Accuracy for Statistics Based on Register and Survey Data.”Discussion Paper, Statistics Netherlands. https://www.cbs.nl/nl-nl/achtergrond/2010/11/estimating-accuracy-for-statistics-based-on-register-and-survey-data (accessed September 12, 2024).

10.

Lee

2006. “Propensity Score Adjustment as a Weighting Scheme for Volunteer Panel Web Surveys.”Journal of Official Statistics 22: 329–49. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/propensity-score-adjustment-as-a-weighting-scheme-for-volunteer-panel-web-surveys.pdf (accessed September 12, 2024).

11.

Lee

Valliant

2009. “Estimation for Volunteer Panel Web Surveys Using Propensity Score Adjustment and Calibration Adjustment.”Sociological Methods & Research 37: 319–43. DOI: https://doi.org/10.1177/00491241083296.

12.

Linder

van Roon

Bakker

B. F. M.

2014. “Combining Data from Administrative Sources and Sample Surveys; the Single-Variable Case. Case Study: Educational Attainment.” Report for Work Package 4.2 of the ESSnet Project Data Integration. https://wayback.archive-it.org/12090/20181005023330/https://ec.europa.eu/eurostat/cros/system/files/WP4.pdf (accessed September 12, 2024).

13.

Liu

A.-C.

Scholtus

De Waal

2023. “Correcting Selection Bias in Big Data by Pseudo-Weighting.”Journal of Survey Statistics and Methodology 11: 1181–203. DOI: https://doi.org/10.1093/jssam/smac029.

14.

Liu

Zheng

Pan

2023. “Doubly Robust Estimation for Non-Probability Samples with Modified IPAD.”Statistical Analysis and Data Mining 16: 224–36. DOI: https://doi.org/10.1002/sam.11614.

15.

Lohr

S. L.

2019. Sampling: Design and Analysis. 2nd ed. Boca Raton, FL: Chapman and Hall/CRC.

16.

Meng

X.-L.

2018. “Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.”Annals of Applied Statistics 12: 685–726. DOI: https://doi.org/10.1214/18-AOAS1161SF.

17.

Pannekoek

2014. “Prorating (Method).” In Handbook on Methodology of Modern Business Statistics. Eurostat. https://cros.ec.europa.eu/group/31/files/2254/download (accessed September 12, 2024).

18.

Pannekoek

De Waal

1998. “Synthetic and Combined Estimators in Statistical Disclosure Control.”Journal of Official Statistics 14: 399–410. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/synthetic-and-combined-estimators-in-statistical-disclosure-control.pdf (accessed September 12, 2024).

19.

Pfeffermann

2002. “Small Area Estimation – New Developments and Directions.”International Statistical Review 70: 125–43. DOI: https://doi.org/10.1111/j.1751-5823.2002.tb00352.x.

20.

Pfeffermann

2013. “New Important Developments in Small Area Estimation.”Statistical Science 28: 40–68. DOI: https://doi.org/10.1214/12-STS395.

21.

Rao

J. N. K.

2003. Small Area Estimation. New York, NY: Wiley.

22.

Rao

J. N. K.

2021. “On Making Valid Inferences by Integrating Data from Surveys and Other Sources.”Sankhya B 83: 242–72. DOI: https://doi.org/10.1007/s13571-020-00227-w.

23.

Rivers

2007. “Sampling for Web Surveys.”Presented at the 2007 Joint Statistical Meetings, Salt Lake City, UT, USA. https://static.texastribune.org/media/documents/Rivers_matching4.pdf (accessed September 12, 2024).

24.

Rivers

Bailey

2009. “Inference from Matched Samples in the 2008 US National Elections.” In JSM Proceedings, Survey Research Methods Section, 627–39. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/y2009f.html (accessed September 12, 2024).

25.

Sakshaug

J. W.

Wiśniowski

Perez Ruiz

D. A.

Blom

A. G.

2019. “Supplementing Small Probability Samples with Nonprobability Samples: A Bayesian Approach.”Journal of Official Statistics 40: 653–81. DOI: https://doi.org/10.2478/jos-2019-00.

26.

Särndal

C.-E.

Swensson

Wretman

J. H.

1992. Model Assisted Survey Sampling. New York, NY: Springer.

27.

Smit

V. I. C.

2021. “Correcting Selectivity in Datasets with Pseudo-Weights: A Simulation Study.” Master thesis, Leiden University, The Netherlands. DOI: https://doi.org/10.13140/RG.2.2.28253.74726.

28.

Valliant

2020. “Comparing Alternatives for Estimation from Nonprobability Samples.”Journal of Survey Statistics and Methodology 8: 231–63. DOI: https://doi.org/10.1093/jssam/smz003.

29.

Valliant

Dever

J. A.

2011. “Estimating Propensity Adjustments for Volunteer Web Surveys.”Sociological Methods & Research 40: 105–37. DOI: https://doi.org/10.1177/0049124110392533.

30.

Van den Brakel

2019. “New Data Sources and Inference Methods for Statistics.” Discussion Paper, Statistics Netherlands. https://www.cbs.nl/en-gb/background/2019/27/new-data-sources-and-inference-methods-for-statistics (accessed September 12, 2024).

31.

Wiśniowski

Sakshaug

J. W.

Perez Ruiz

D. A.

Blom

A. G.

2020. “Integrating Probability and Nonprobability Samples for Survey Inference.”Journal of Survey Statistics and Methodology 8: 120–47. DOI: https://doi.org/10.1093/jssam/smz051.

32.

Yang

Kim

J. K.

Song

2020. “Doubly Robust Inference When Combining Probability and Non-Probability Samples with High Dimensional Data.”Journal of the Royal Statistical Society Series B: Statistical Methodology 82: 445–65. DOI: https://doi.org/10.1111/rssb.12354.

Combining Probability and Nonprobability Samples on an Aggregated Level

Abstract

Keywords

1. Introduction

2. Methodology

3. Model Assessment with Simulated Data

3.1. Simulated Conditions

3.2. Evaluation

3.3. MARMSE and Bias Compared to Single-Sample Estimators

3.4. Performance of the Combined Estimator Related to the Average Weight

3.5. MARMSE and Bias Compared to Other Combined Estimators

4. Application to Real Data

4.1. Data

4.2. Results

5. Discussion

Footnotes

Appendix A

Appendix B

Acknowledgements

Funding

ORCID iD

Data Availability Statement

References