Sage Journals: Discover world-class research

Abstract

There is a growing trend among statistical agencies to explore non-probability data sources for producing more timely and detailed statistics, while reducing costs and respondent burden. Coverage and measurement error are two issues that may be present in such data. The imperfections may be corrected using available information relating to the population of interest, such as a census or a reference probability sample. In this paper, we compare a wide range of existing methods for producing population estimates using a non-probability dataset through a simulation study based on a realistic business population. The study was conducted to examine the performance of the methods under different missingness and data quality assumptions. The results confirm the ability of the methods examined to address selection bias. When no measurement error is present in the non-probability dataset, a screening dual-frame approach for the probability sample tends to yield lower sample size and mean squared error results. The presence of measurement error and/or nonignorable missingness increases mean squared errors for estimators that depend heavily on the non-probability data. In this case, the best approach tends to be to fall back to a model-assisted estimator based on the probability sample.

Keywords

estimation selection bias big data cut-off sampling dual frame

1. Introduction

Probability survey samples have been, for the best part of the last century, the preferred data collection tool of statistical agencies for producing population estimates. Neyman (1934) laid the groundwork for design-based probability sampling theory, and the theory for estimating finite population quantities using probability survey samples is now well-established. Compared with censuses, probability samples provide less costly, more timely, and more detailed information on populations of interest. Methods have been developed to take advantage of additional information from other sources such as administrative datasets to improve the efficiency of estimates from probability samples (see, e.g., Särndal et al. 1992).

In today’s environment, the same reasons that drove the shift from censuses to probability samples are pushing statistical agencies to move away from probability samples to explore alternative data sources (Beaumont 2020). Running a high-quality probability sample is costly, users have a desire for more timely statistics at greater levels of detail, and there is growing expectation for statistical agencies to reduce respondent burden. For example, the Australian Bureau of Statistics (ABS 2022b) priorities for 2022 to 2023 include increasing the use of non-survey data sources to produce new, timely statistics and reduce burden on small and medium enterprises. Statistics Canada (2018) has launched a modernization initiative with five pillars, one of which involves the use of new data sources, lower respondent burden, greater reliance on data integration and modeling and the reduced role of surveys.

These new data sources may include big data or “found data” such as sensor data, satellite imagery and transactions data, and non-probability survey samples such as online web panels. In this paper we apply the term non-probability dataset to any dataset where only part of the population is included, and the probability of inclusion into the dataset is unknown.

The potential benefits of non-probability data for producing statistics has been recognized, for example in Tam and Clarke (2015), Rao and Fuller (2017), and Tillé et al. (2022). In particular, they promise to address the weaknesses of probability survey samples:

Data collection can be much cheaper since the data is already captured through an existing process (generally for some purpose independent of producing statistics)

Outputs are more timely

Utilizing existing data reduces the need for agencies to ask people or businesses for the same or similar information

Unfortunately, non-probability data sources may suffer from quality issues. Two types of error pervading these data are coverage error (also known as selection bias) and measurement error. Coverage error occurs when we do not have a one-to-one correspondence between the population of interest and the population sampled. It may include: undercoverage, where some units in the target population have been excluded from the sampling population (e.g., new business startups not yet included on a business register at the time of a survey); overcoverage, where the sampling population includes units that are not in the target population (e.g., inactive or closed businesses); and duplication, when some units in the sampling population can be selected more than once (e.g., when a merger leads to the same business being included twice on a survey frame).

Our focus in this paper is on undercoverage error. Meng (2018) showed that undercoverage in a non-probability sample can have a significant impact on the quality of an estimate. In fact, the error can get bigger the larger the dataset is, such that it may be preferable to use a small probability sample than a large dataset containing selection bias.

We will also refer to measurement error broadly as a misalignment between the true value of a target concept, and the value actually being captured by the sample. The measurement error may thus arise due to issues such as differences in terms of how concepts are defined, recording or instrument errors, and so on. For example, a business may report their revenue in whole dollars on a survey form which is asking for reporting to be done in thousands of dollars.

In recent years, methods have been developed to address the shortcomings of these new data sources, and in particular selection bias in the non-probability sample. These methods enable the statistician to make direct use of the non-probability data to produce statistics. Generally these methods require us to have additional information from the population that we can use. The auxiliary information is typically used to correct for coverage error or to form models to impute for the missing units in the population, and could include known or estimated population totals, unit level information from a population register or frame, or from an independent probability survey sample (which we will refer to as a reference sample) taken from the same population.

In this paper, we examine and compare through a realistic and wide-ranging simulation study the effectiveness of a cross-section of the estimation methods that have been developed for non-probability data, some in combination with probability sample data. The comparison is undertaken in a business survey context. Business surveys generally have a number of characteristics that will influence the choice and performance of an estimation approach; see Hidiroglou and Lavallée (2009) for a good overview of these characteristics. For instance, there tends to be a larger quantity of auxiliary information available on a business population frame, including items that are well-correlated with the data items of interest, for example through administrative data from taxation agencies. Business data items often also have skewed distributions, while their sample designs tend to be highly stratified, with large variations in the probabilities of selection. In particular, the largest contributors tend to be included with a probability of 1. Finally, there is often a reliable identifier (such as an official business number from the tax system) available, allowing different data sources to be linked. These characteristics will influence the data available from any reference sample from a business population.

We focus on the situation where the non-probability dataset makes up a large (approximately 50%) portion of the population, an accompanying business probability sample from the population is available as auxiliary data, and the non-probability dataset can be linked to units on the probability sample and the population frame. The study explores the effects of nonignorable non-response and measurement error in the non-probability sample on the performance of the various estimators, with the probability sample used to help address these issues. To our knowledge, an empirical comparison of such a wide range of methods in a realistic business survey context has not been done before. The results from the study provide some practical insights for statistical agencies looking to use these estimators to produce inference using their own large business non-probability datasets.

The rest of the paper is structured as follows. The basic setup for this paper is outlined in Section 2. In Section 3, we provide a brief overview of the various estimation approaches that have been developed in recent years in the non-probability sample space. Section 4 discusses a number of sample design frameworks that may be used to produce reference samples to address the shortcomings of the non-probability dataset. In Section 5 we provide a description of the empirical comparison study using simulated business data and discuss the results. The simulation is grounded in data analysis of ABS Business Longitudinal Analysis Data Environment (BLADE, 2020) data and includes empirically-derived right-skewed distributions and realistic missingness scenarios in the non-probability data. We conclude with some final thoughts in Section 6.

2. Basic Setup

For each unit $i$ in a finite population $U$ of size $N$ , we have values ${(x_{i}, y_{i})}^{T}$ for a variable of interest $y$ and some additional auxiliary data items $x$ . For the purpose of this paper we are interested in estimating the population total $Y = \sum_{i = 1}^{N} y_{i}$ , although we note it is also often of interest to estimate the mean $\bar{Y} = N^{- 1} \sum_{i = 1}^{N} y_{i}$ .

Suppose we have a probability sample $A$ (the “reference” sample) of size $n_{A}$ drawn from the population. $A$ may have been collected for a different purpose than we are interested in, and so may contain information about $x$ only. In other cases $A$ may include both $y$ and $x$ . The data collected in $A$ are obtained without error. Define $π_{i}^{A} = P (i \in A | U)$ as the inclusion probability for unit $i$ being in the probability sample, and $d_{i}^{A} = 1 / π_{i}^{A}$ is the design weight for $i \in A$ . The $π_{i}^{A}$ values are known from the sample design.

Denote a non-probability dataset by $B$ . Like the probability sample, $B$ may contain information on $x$ only, for example if $B$ is an administrative dataset collecting general population information. Alternatively $B$ may include both $x$ and $y$ , for example if it is a web panel survey collecting data on our variables of interest. Let $δ_{i} = I (i \in B)$ be an indicator variable for unit $i$ being included in the sample $B$ . The non-probability sample size is $N_{B} = \sum_{i = 1}^{N} δ_{i}$ , $Y_{B} = \sum_{i = 1}^{N} δ_{i} y_{i}$ is the sum of the $y$ values in the non-probability dataset, and $X_{B} = \sum_{i = 1}^{N} δ_{i} x_{i}$ is the sum of the $x$ variables in the non-probability dataset. In contrast to $π_{i}^{A}$ in the reference probability sample, the inclusion probabilities $π_{i}^{B} = P (δ_{i} = 1 | U)$ are unknown and need to be estimated. Further define $C = U ∖ B$ , the units in the population not included in $B$ . There may or may not be overlap between the two samples $A$ and $B$ . Figure 1 depicts these domains within $U$ .

Figure 1.

Domains within the population $U .$

Some assumptions are often made regarding the selection mechanism into $B$ , in order to facilitate inferences using those datasets. Three assumptions are generally adopted by the methods in this paper; see, for example Y. Chen et al. (2020) and Yang et al. (2021). These assumptions include:

A1 Ignorability: Conditional on the set of covariates $x_{i}$ , $δ_{i}$ and $y_{i}$ are independent.

A2 Positivity or Common Support: Conditional on $x_{i}$ , $P (δ_{i} = 1 | x_{i}) > 0, \forall i \in U$ .

A3 Independence: Conditional on $x_{i}$ and $x_{j}$ , $δ_{i} ⊥ δ_{j}, i \neq j, \forall i, j \in U$ .

Ignorability implies that $P (δ_{i} = 1 | x, y) = P (δ_{i} = 1 | x)$ . In other words, selection into $B$ is ignorable conditional on the covariates $x$ . This assumption is similar to the Missing-At-Random (MAR) scenario of Rubin (1976). Andridge et al. (2019) and Boonstra et al. (2021) refer to this type of selection process as Selection At Random (SAR). When selection into $B$ is influenced by $y$ , then we instead have Selection Not At Random (SNAR), akin to the Missing-Not-At-Random (MNAR)/Not-Missing-At-Random (NMAR) scenario. In our paper we adopt the SAR and SNAR terminology to describe the type of missingness in $B$ . The simulation study explores how estimators perform under the SAR versus SNAR situations.

In addition to the above, the following will be assumed in some of the methods discussed in this paper:

A4 It is possible to accurately link records in $B$ to the corresponding unit in the population frame and $A$ .

A5 There is full response in $A$ .

A6 The full set of auxiliary information $x$ is available in both $A$ and $B$ , without error.

Assumption A4 may be satisfied if there exists a common unit identifier available on $B$ and the population frame, or if it is possible to accurately link records in $B$ to the population frame and/or $A$ using a common set of linking variables. In the business population setting, businesses may have a unique business number for administrative purposes such as filing taxation returns, and this identifier can be available on both the business frame and external data sources.

Assumption A5 will tend not to reflect reality, especially in this age of declining response rates (Beaumont 2020). The full response assumption in $A$ is made to simplify the simulation and discussion of its results, but we note that in general this assumption may be relaxed by assuming response probabilities in $A$ can be estimated using $x$ or a subset thereof. The relaxation implies that any non-response in $A$ will be MAR and addressable through a non-response weight adjustment or imputation process. With sufficient resources dedicated to careful non-response follow-up, we should be able to ensure response rates are relatively high and render any remaining non-response in the probability sample as ignorable.

Assumption A6 implies that we have access to all auxiliary variables relevant to Assumptions A1, A2, and A3 on the datasets $A$ and $B$ , and these variables can be fed in without error into any models we subsequently develop. On the other hand, we allow $y$ to be collected with measurement error on the non-probability dataset. Instead of observing $y_{i}$ we observe $y_{i}^{*}$ , and the two are assumed to be related through the relationship (see Kim and Tam 2021)

y_{i}^{*} = β_{0} + β_{1} y_{i} + e_{i}

(1)

where $(β_{0}, β_{1})$ is unknown and $e_{i} ~ (0, σ^{2})$ . In the case that $(β_{0}, β_{1}) = (0, 1)$ , we have no measurement error in $y$ . In our simulation study, we are particularly interested in the performance of estimators with versus without measurement error.

3. Inference Methods for Non-Probability Datasets

In this section we provide a brief overview of the literature on methods developed for making inference using non-probability datasets. A number of very good review papers also exist in this space that provide more detailed discussions of the various methods. The interested reader is pointed to S. L. Lohr and Raghunathan (2017), Zhang (2019), Tam and Holmberg (2020), Yang and Kim (2020), Rao (2021), and Wu (2022). More recently, Salvatore (2023) analyzed a large number of documents on this topic using text mining and bibliometric techniques to identify current research trends.

The method of inference used depends on the data structures we have for the non-probability and reference samples. There are three general approaches that will be discussed in this paper: weighting-based approaches, imputation-based approaches, or a combination of the two (so-called doubly-robust methods).

The data available to us in $A$ and $B$ will influence what estimation methods will be most suitable to use. Table 1, inspired from Tam and Holmberg (2020), describes four (Type I–Type IV) potential data structures relating to the non-probability sample and an accompanying probability sample. For completeness, we also include a fifth data structure—Type 0—which reflects the more standard survey situation when data is available from a probability sample and accompanying population frame, but not the non-probability sample. We will occasionally refer to these data types in our discussion.

Table 1.

Data Structures.

Data type		Response	Auxiliary
		Variables $y$	Variables $x$
Type 0	Frame	×	✓
Type 0	P Data $(A)$	✓	✓
Type I	NP Data $(B)$	✓	✓
Type II	NP Data $(B)$	✓	✓
Type II	P Data $(A)$	×	✓
Type III	NP Data $(B)$	✓	✓
Type III	P Data $(A)$	✓	✓
Type IV	NP Data $(B)$	×	✓
Type IV	P Data $(A)$	✓	✓

3.1. Weighting Approaches

Weighting approaches create a weight associated with each record in the reference sample or the non-probability sample, and these are used to form estimates for a target population quantity using an Inverse Probability Weighting (IPW) approach. Define $w_{i}^{A}$ as the survey weight associated with a record $i$ in $A$ . $w_{i}^{A}$ may be equal to $d_{i}^{A} = 1 / π_{i}^{A}$ defined in Section 2, or $d_{i}^{A}$ adjusted via a non-response and/or calibration process. Population totals may be estimated by $\hat{Y} = \sum_{i \in A} w_{i}^{A} y_{i}$ .

Let ${\hat{π}}_{k}^{B}$ be an estimate of the propensity of selection into $B$ for unit $k$ in $B$ . Then $w_{k}^{B} = 1 / {\hat{π}}_{k}^{B}$ is the corresponding weight associated with record $k$ , and we may form estimates of population totals using $\hat{Y} = \sum_{k \in B} w_{k}^{B} y_{k}$ . Wu (2022) notes that a better estimator to use is the Hájek estimator

{\hat{Y}}_{H á jek} = \frac{N}{{\hat{N}}_{B}} \sum_{k \in B} w_{k}^{B} y_{k}

(2)

where ${\hat{N}}_{B} = \sum_{k \in B} w_{k}^{B}$ .

When a Type III data structure exists, and assuming that we can determine the value of $δ_{i}$ for the units in $A$ (e.g., by linking the sample $A$ to $B$ ), Kim and Tam (2021) proposed calibrating the design weights in $A$ to the quantities $\sum_{i = 1}^{N} (1, δ_{i}, δ_{i} y_{i}) = (N, N_{B}, Y_{B})$ . The calibration process involves finding new weights $w_{i}^{A}$ for $i \in A$ that minimize a chosen distance metric subject to the calibration constraints $\sum_{i \in A} w_{i}^{A} x_{i} = \sum_{i \in U} x_{i} = X$ (Deville and Särndal 1992), where $X$ is the vector of population totals for the set of auxiliary variables. In particular, the generalized regression estimator is a special case of the calibration estimator when the Chi-Square distance metric $\sum_{i \in A} {(w_{i}^{A} - d_{i}^{A})}^{2} / 2 q_{i} d_{i}^{A}$ is used, where $q_{i}$ is a tuning parameter. The set of calibration constraints can be expanded to include auxiliary variables (e.g., to address non-response) in $A$ , and the method can also cater for measurement error in $y$ on either $A$ or $B$ .

Rather than making a weight adjustment for the probability sample, we can instead produce a weighted non-probability sample. This seems attractive in particular when $N_{B}$ is large in comparison with $n_{A}$ . The main purpose of the reference sample $A$ is now to aid in estimating the propensities of selection $π_{k}^{B} = f (x_{k}, ϕ)$ for $k \in B$ , where $f$ is a chosen parametric form such as the inverse logit function, and $ϕ$ are unknown parameters requiring estimation.

Under Assumption A1 and a Type I data structure, selection bias may be reduced by applying a weight calibration process to $B$ which satisfies $\sum_{k \in B} w_{k}^{B} x_{k} = X$ (see, e.g., Haziza et al. 2010). The initial weights feeding into the calibration may be given by $d_{k}^{B} = N / N_{B}$ or $d_{k}^{B} = 1$ (Golini and Righi 2024; Rueda et al. 2023). If $X$ is not available but a Type III data structure exists, a pseudo-calibration estimator may be applied (Golini and Righi 2024; see also Righi et al. 2019) using estimated totals from the reference probability sample to feed in to the calibration equation

\sum_{k \in B} w_{k}^{B} x_{k} = \sum_{i \in A} d_{i}^{A} x_{i}

(3)

to produce final weights $w_{k}^{B}$ for the units in $B$ .

Direct estimation of propensity scores ${\hat{π}}_{k}^{B}$ may be accomplished in a variety of ways, and methods generally assume a Type II data structure. Y. Chen et al. (2020) derive a pseudo log-likelihood function with two terms, one involving $B$ and the other involving $A$ , and this function can be maximized through iterative methods such as the Newton-Raphson procedure. Burakauskaitė and Čiginas (2023) detail a variation of this method which is applicable when $x_{i}$ are available for $i \in U$ . Kim and Wang (2019) assume that it is possible to determine $δ_{i}$ for units in $A$ , and use this to estimate propensity scores based on the probability sample.

Elliott and Valliant (2017) estimate ${\hat{π}}_{k}^{B}$ for $k \in B$ by first pooling together the non-probability data and reference probability data. For unit $i$ in the pooled dataset $A \cup B$ , they define $Z_{i} = 1$ if unit $i$ belongs to $B$ , and $Z_{i} = 0$ if unit $i$ belongs to $A$ . Estimated conditional probabilities $\hat{P} (Z_{i} = 0 | x_{i})$ and $\hat{P} (Z_{i} = 1 | x_{i})$ are estimated through a modeling process, and then combined to produce an estimated propensity of selection for units in $B$ . This approach does not require knowledge about $δ_{i}$ for units in $A$ . On the other hand, the method assumes small sampling fractions and no overlap in the units captured within the two samples—an assumption that is not likely to hold if $B$ is large. Liu et al. (2023) addressed the overlap issue by using only the non-overlapping units in the pooled sample to fit a model to estimate certain probabilities to feed into the creation of pseudo-weights. Their method requires knowing which units comprise $A \cap B$ (so that these can be removed from the pooled sample), and is not appropriate when the non-overlapping portion of the two samples is very small, or if one sample is a subset of the other.

Wang et al. (2021) proposed an Adjusted Logistic Propensity (ALP) weighting method which also pools the non-probability and probability datasets together, but does not require the assumption of non-overlapping samples. However, in their method the final estimated propensity may sometimes be greater than 1 if $B$ is large in size. Savitsky et al. (2023) noted that the methods of Wang et al. (2021) and Y. Chen et al. (2020) are sub-optimal as they rely on pseudo-likelihoods for estimating propensities. Instead, Savitsky et al. (2023) constructed a likelihood defined directly on the observed pooled sample to estimate propensities. Their method does not require the samples $A$ and $B$ to be disjoint, allowing an unknown amount of overlap between the samples to be present. A hierarchical Bayesian approach was utilized to enable computation of all required probabilities simultaneously. In a simulation setting under moderate sample sizes, the method was found to produce more accurate estimates of inclusion probabilities for $B$ compared to the pseudo-likelihood approaches used in Wang et al. (2021) and Y. Chen et al. (2020).

If Assumption A1 does not hold and the missingness in $B$ is SNAR, then $π_{k}^{B} = f (x_{k}, y_{k}, ϕ)$ for $k \in B$ and we need to include the variable of interest $y$ in the model to estimate ${\hat{π}}_{k}^{B}$ . Marella (2023) and Kim and Morikawa (2023) employed the sample empirical likelihood to estimate ${\hat{π}}_{k}^{B}$ under a SNAR selection mechanism for $B$ . If known population means of auxiliary variables are available under a Type I data structure, they can be included as calibration constraints in the maximization process for the empirical likelihood to help address selection bias. Marella (2023) noted that under a Type II data structure, sample estimates for these auxiliary variables may be used as the calibration constraints.

Machine learning techniques have also been explored to estimate the propensity scores. Ferri-García and Rueda (2020) provide a comparison of different machine learning approaches using a pooled dataset $A \cup B$ which assumes no overlapping sample, and under a Type II data structure. Castro-Martín et al. (2022) proposed including the weights derived from a propensity score estimation process, ${\hat{w}}_{k}^{B} = 1 / {\hat{π}}_{k}^{B}, k \in B$ , as part of a subsequent machine learning model training process (e.g., linear regression) to predict ${\hat{y}}_{i}, i \in A$ under a Type II data structure. The imputed values obtained from the trained model may then be used to form estimates based on $A$ :

{\hat{Y}}_{MI 1} = \sum_{i \in A} w_{i}^{A} {\hat{y}}_{i}

(4)

More accurate estimated propensity scores may be achieved by incorporating known information about the auxiliary variables in the propensity score estimation for $B$ . Zhu et al. (2023) assume a latent Gaussian copula model for the joint distribution of the auxiliary variables. The model is fitted using data in $A$ , and a pseudo-population is simulated using the fitted model. The pseudo-population is used to estimate the marginal inclusion probabilities $P (δ_{k} = 1), k \in B$ , and these are then used to estimate propensity scores ${\hat{π}}_{k}^{B}$ .

3.2. Imputation-Based Approaches

Imputation-based approaches assume that we can form a reliable estimate for $y_{i}$ using the available auxiliary information $x_{i}$ . This approach will generally be used when $y$ is missing from $A$ (Type II data structure) or $B$ (Type IV data structure). When a Type II data structure exists, the model for $y$ is formed using the data in the non-probability sample and then applied to impute $\hat{y}$ for all the units in the reference sample, referred to as mass imputation (Chipperfield et al. 2012).

Under a Type II structure, an alternative imputation estimator to Equation (4) is (see Wu 2022)

{\hat{Y}}_{MI 2} = \sum_{k \in B} y_{k} + (\sum_{i \in A} w_{i}^{A} {\hat{y}}_{i} - \sum_{k \in B} {\hat{y}}_{k})

(5)

where $w_{i}^{A}$ may be design, non-response adjusted, or calibrated weights for $i \in A$ , and ${\hat{y}}_{i}$ are predictions from a model, which may range from a linear model, a semi-parametric model such as a generalized additive model or a kernel regression (S. Chen et al. 2022), to non-parametric methods such as regression trees and random forests (Golini and Righi 2024). The estimator Equation (5) may be interpreted as the sum of the true values from $B$ and an estimated contribution for $C = U ∖ B$ based on the modeled $\hat{y}$ values.

Rivers (2007) proposed a sample matching approach using nearest neighbor imputation to mass impute the missing $y_{i}$ values for $i \in A$ under a Type II data structure. The non-probability dataset is treated as the donor population. A distance measure indicating the similarity of units in $A$ and $B$ is calculated using the available covariates $x$ , and the closest match is chosen to supply ${\hat{y}}_{i}$ . Yang et al. (2021) extended the nearest neighbor imputation method to $k$ nearest neighbors, again assuming a Type II data structure, where data from the $k$ nearest neighbors are combined to produce a mean value ${\hat{y}}_{i} = k^{- 1} \sum_{i = 1}^{k} y_{i}$ which is used as the impute. When $δ_{i}$ can be identified for each unit in $A$ , the authors showed that the efficiency of the nearest neighbor imputation can be improved by also calibrating the weights in the probability sample to match known quantities in $B$ , for example:

\sum_{i \in A} w_{i}^{A} (δ_{i}, 1 - δ_{i}, δ_{i} x_{i}, δ_{i} {\hat{y}}_{i}) = (N_{B}, N - N_{B}, X_{B}, Y_{B})

where $w_{i}^{A}$ are the final calibrated weights and $N$ , $N_{B}$ , $X_{B}$ , and $Y_{B}$ are defined in Section 2.

When a Type IV data structure exists, the reference sample may be used to estimate the parameters of the model for $y_{i} | x_{i}$ . The fitted model is then used to provide predictions for $y_{k}, k \in B$ . In Righi et al. (2019) and Golini and Righi (2024), a combined imputation and weighting approach is proposed whereby predicted values ${\hat{y}}_{k}$ for the variable of interest are produced from a modeling process, and are used in a pseudo-calibration estimator ${\hat{Y}}_{PC} = \sum_{k \in B} w_{k}^{B} {\hat{y}}_{k}$ , where $w_{k}^{B}$ is the solution to the calibration equation (3). The authors note that their approach can also be used in a Type III data structure when $y$ is observed with error in $B$ (i.e., $B$ collects $y_{k}^{*}$ instead of $y_{k}$ ) and a measurement error model such as (1) is fitted to correct the error.

In the case where the modeling process fails to capture the true relationship between predictors and the variable of interest, ${\hat{Y}}_{PC}$ will be biased. Righi et al. (2019) and Golini and Righi (2024) proposed amending ${\hat{Y}}_{PC}$ by including a bias correction term in the estimator. This leads to the difference estimator (see also Breidt and Opsomer 2017).

{\hat{Y}}_{DPC} = \sum_{k \in B} w_{k}^{B} {\hat{y}}_{k} + \sum_{i \in A} d_{i}^{A} (y_{i} - {\hat{y}}_{i})

(6)

Medous et al. (2023) extended the calibration approach of Kim and Tam (2021) from a Type III to a Type IV structure, proposing so-called QR predictors (see Wright 1983) to produce an estimated ${\hat{Y}}_{B}$ . ${\hat{Y}}_{B}$ may then be combined with an estimated contribution from $C$ (via $A$ ) to produce an improved population estimate.

3.3. Doubly Robust Estimation

Many of the approaches outlined in the previous sections depend on an accurate working model. To protect against model misspecification, some authors have suggested doubly robust estimators constructed using both a propensity score model for $B$ and a model for $y | x$ . The setup only requires one of the two models to be correctly specified in order to be unbiased.

Assuming a Type II data structure, Y. Chen et al. (2020) proposed two doubly robust estimators for $\bar{Y}$ , with the second estimator preferred:

{\hat{\bar{Y}}}_{DR 1} = \frac{1}{N} \sum_{k \in B} \frac{y_{k} - {\hat{y}}_{k}}{{\hat{π}}_{k}^{B}} + \frac{1}{N} \sum_{i \in A} \frac{{\hat{y}}_{i}}{π_{i}^{A}}

(7)

and

{\hat{\bar{Y}}}_{DR 2} = \frac{1}{{\hat{N}}^{B}} \sum_{k \in B} \frac{y_{k} - {\hat{y}}_{k}}{{\hat{π}}_{k}^{B}} + \frac{1}{{\hat{N}}^{A}} \sum_{i \in A} \frac{{\hat{y}}_{i}}{π_{i}^{A}}

(8)

where ${\hat{y}}_{i}$ and ${\hat{y}}_{k}$ are the estimated values of $y_{i}, i \in A$ and $y_{k}, k \in B$ based on a fitted imputation model using $(y_{k}, x_{k})$ in B, ${\hat{π}}_{k}^{B}$ is the estimated probability of inclusion for $k \in B$ using a method from Section 3.1, ${\hat{N}}_{A} = \sum_{i \in A} w_{i}^{A}$ and ${\hat{N}}_{B} = \sum_{k \in B} 1 / {\hat{π}}_{k}^{B}$ .

The doubly robust approach can be extended to a multiply robust scenario. Instead of employing a single propensity score model to estimate $π_{k}^{B}, k \in B$ , and a single imputation model, S. Chen and Haziza (2023) suggested the use of $m$ propensity score models and $m$ imputation models. Each model may be based on different sets of explanatory variables. The results from the $m$ models are “compressed” or summarized to form an overall estimated value for ${\hat{y}}_{k}$ or ${\hat{π}}_{k}^{B}$ . The estimator is consistent as long as one of the imputation or propensity score models is correctly specified. Kim and Morikawa (2023) also suggest the use of multiple propensity score models and multiple constraints for bias calibration in their empirical likelihood approach.

4. Alternative Sampling Frameworks for the Reference Sample

In the approaches described in Section 3, the reference probability samples $A$ are assumed to be a given. In practice, the reference sample design may be approached in a number of ways based on the nature of the available non-probability data. In this section, we briefly discuss two sampling frameworks for the reference sample as alternatives to a traditional design-based sample from the full population.

4.1. Multiple Frame Approach

The original theory for multiple-frame surveys was developed by Hartley (1962, 1974), and has been built upon in recent years; see, for example, S. Lohr and Rao (2006) and S. L. Lohr (2011). In the context of non-probability data, one can consider $B$ to be a full census from an incomplete population “frame” (S. L. Lohr 2021), and may or may not measure $y$ , instead measuring covariates $x$ that can be used to predict $y$ . S. L. Lohr (2021) and Medous et al. (2023) note that the data integration estimators developed in Kim and Tam (2021) can be considered within a multiple frame context.

Assume that we can identify $C = U ∖ B$ , for example through a common unit identifier (in general, the multiple frame approach assumes that we can determine which population segment(s) each unit belongs to). Then we can employ a screening dual frame sample design and select a probability sample $A$ from $C$ only. A consistent estimator of $Y$ is then simply

{\hat{Y}}_{screening} = Y_{B} + {\hat{Y}}_{C}

(9)

where ${\hat{Y}}_{C}$ may be estimated using a Horvitz-Thompson or Hájek estimator. Calibration benchmarks for the subpopulation $C$ may be used if available to improve the efficiency of ${\hat{Y}}_{C}$ . Zhang (2019) refers to Equation (9) as the split-population approach to inference, where $B$ and $C$ constitute the two “populations.” More generally, Zhang (2019) noted that one can produce a composite estimator for the population mean based on the mean of $B$ and the mean of $C$ estimated from a reference sample.

A number of assumptions are generally made when making inferences under the multiple-frame framework. Within a data integration context, some of these assumptions may be less likely to hold. One potential issue is that the variables captured on the non-probability source may not match exactly the variables of interest captured in the probability sample. If a “screening” dual frame design has been used, the lack of overlapping sample makes it more difficult to assess and address any measurement error in the non-probability dataset.

4.2. Cut-Off Sampling

In cut-off sampling, a certain fraction of the population are deliberately excluded from the survey frame. A probability sample is then taken from the remainder of the population. This practice tends to be used in business surveys when the variable of interest is highly skewed; Yorgason et al. (2011) and Elisson and Elvers (2001) provide some examples. Generally, the smallest businesses are placed into a single “take-none” stratum with zero probability of selection. The contribution of these businesses to the variable of interest is assumed to be negligible compared with the remaining part of the population, so there is a saving in terms of reduced respondent burden and cost without causing significant bias (Elisson and Elvers 2001). The rest of the population may be further divided into a “take-all” (completely enumerated) and a “take-some” (sampled) stratum. Denote the population in the take-none, take-some, and take-all strata as $U_{E}$ , $U_{S}$ , and $U_{CE}$ respectively, while $Y_{E}$ , $Y_{S}$ , and $Y_{CE}$ are the totals for $y$ in those strata.

The non-probability dataset would often be useful for estimating the part of the population $U_{E}$ intentionally not covered by the cut-off sample A. For example, if $y$ is measured in both $A$ and $B$ , then we might estimate $Y_{E}$ as

{\hat{Y}}_{E} = \frac{N_{E}}{N_{E, B}} \sum_{i \in U_{E} \cap B} y_{i}

where $N_{E}$ is the number of units in the population in the take-none stratum and $N_{E, B}$ is the number of units in $B$ in the take-none stratum.

Note that this estimator will only be approximately unbiased if $\frac{Y_{E}}{Y_{CE} + Y_{S}} \approx \frac{Y_{B, E}}{Y_{B, CE} + Y_{B, S}}$ . This approximation may be very imperfect in practice, but it may still be adequate when $Y_{E}$ is small, as is usually the case by design. If we have available auxiliary variables $x$ for the whole population, then we may use this to model the propensities $π_{k}^{B}$ , and thus form an IPW estimate for the take-none stratum by applying Equation (2) to $B \cap U_{E}$ . The audit sampling ideas of Zhang (2019, 2021, 2023) may come in useful here to inform appropriate action for the cut-off population, with the idea that the audit sample only has to be taken intermittently.

5. Empirical Comparison of Estimation Approaches

A simulation study was conducted to compare different estimation approaches for non-probability samples within a business survey context. We selected a cross-section of approaches that would be straightforward to implement for a statistical agency. The aim of the exercise was to examine their performance under four scenarios: SAR versus SNAR missingness in the big dataset, and with versus without measurement error in $y$ on the big dataset. In our simulation the non-probability dataset includes a large fraction (about 50%) of the population, so may be considered a “big dataset.”

In addition to Assumptions A4 to A6 in Section 2, the simulation study assumes that the population frame includes some auxiliary information, including frame employment and industry class, and these variables are available to use during the estimation process. We can reasonably expect this to hold in a business survey context. For example, some business information is often be available from administrative sources (like business tax data) to attach to the population frame.

The simulation consisted of the following steps: (1) generating a number of data items for a finite population with distributional properties similar to some items on a real business survey dataset, (2) drawing a random subsample from the population to be the “big dataset,” (3) drawing a reference probability sample, and (4) applying some of the methods described in Section 3 to produce estimates.

For each of the sample designs considered, $R = 2, 000$ repeated samples $A$ and $B$ were drawn and combined in some way to produce estimates. The Monte Carlo Relative Bias (RB) and Relative Root Mean Squared Error (RRMSE) of the estimators were then calculated as

\begin{matrix} RB = \frac{1}{R} \sum_{r = 1}^{R} \frac{{\hat{Y}}_{r} - Y}{Y} \\ RRMSE = \frac{\sqrt{\frac{1}{R} \sum_{r = 1}^{R} {({\hat{Y}}_{r} - Y)}^{2}}}{Y} \end{matrix}

where ${\hat{Y}}_{r}$ represents the estimate of $Y$ from the $r$ ’th repeated sample, and $Y$ is the true population total for $y$ .

5.1. Generating the Population and Big Dataset

The simulated population has $N = 900, 000$ business records, consisting of two categorical domain variables (State and Industry), a frame size variable (Frame Employment), and three survey variables (Reported Employment, Total Weekly Wages/Salaries, and Overtime Pay). It resembles the real-world population of employing businesses in Australia in terms of the distribution of businesses across size categories, industry divisions, and state. A combination of published survey outputs alongside employee tax data and survey microdata sourced from the ABS DataLab environment was used to help generate the survey variables of interest. The Supplemental Data provides a detailed description of the process used to create the population and the data items with and without measurement error. The synthetic population is available at this link: https://zenodo.org/records/11095755.

The selection mechanism for inclusion in the big dataset is given by $δ_{k} ~ Bernoulli (π_{k}^{B})$ . A two-stage process was used to generate the final values of $π_{k}^{B}$ for the population. At the first stage, an initial probability $π_{1 k}^{B}$ was produced, where

π_{1 k}^{B} = \frac{\exp (ϕ_{0} + ϕ_{1} x_{k} + ϕ_{2} y_{k})}{1 + \exp (ϕ_{0} + ϕ_{1} x_{k} + ϕ_{2} y_{k})}

Two types of big dataset were produced, one following a SAR process, and one following a SNAR process. $(ϕ_{0}, ϕ_{1}, ϕ_{2}) = (0.09, 0.009, 0)$ for the SAR dataset, and $(ϕ_{0}, ϕ_{1}, ϕ_{2}) = (0.85, 0.009, - 0.1)$ for the SNAR dataset. In the calculation of $π_{1 k}^{B}$ for both datasets, the $x$ variable used was Frame Employment, and the $y$ variable used was the natural logarithm of total weekly earnings.

At the second stage, the $π_{1 k}^{B}$ values were adjusted downwards by a pre-specified factor in some industries, to simulate a reduced likelihood of being present on the big dataset for those industries. The resulting probabilities were our final $π_{k}^{B}$ values. In our non-response models, units with smaller frame employment have a lower chance of being included in the big dataset $B$ .

For each sample draw of the simulation, a big dataset sample was drawn using Poisson sampling and the final $π_{k}^{B}$ probabilities.

5.2. Probability Sample Designs

The population was assigned to strata based on State, Industry Division, and Frame Employment for each business. Size stratum categories were: 0 to 4 employees, 5 to 19 employees, 20 to 299, and 300+.

Three sample design scenarios were examined:

Single-frame—An optimal allocation of sample to strata using the full population frame, $U$

Dual-frame—An optimal allocation of sample to strata using $C = U ∖ B$ as the probability sampling frame (with $B$ comprising the other frame)

Cut-off—An optimal allocation of sample to strata where the sampling frame is the population excluding units in the smallest (0–4 employees) size class

The Bethel-Chromy algorithm (Bethel 1989; Chromy 1987) was applied to produce optimal allocations for each of the three scenarios, treating the relevant sample frame as the population of interest. For example, in the dual-frame scenario the algorithm was applied to achieve accuracy targets based on the units in $C$ . The sample designs were produced to meet the following accuracy constraints for the total earnings data item on the relevant sample frame:

Relative Standard Error (RSE) of 1.5% at the National level

RSE of 5% for each Industry Division

RSE of 5% for each State

A minimum sample size of 6 was applied for each sampled stratum. The 300+ size strata were designated to be completely enumerated strata with a sampling fraction of 1.

5.3. Estimators Examined

Table 2 provides a description of the estimators compared for the single-frame sample design.

Table 2.

Summary of Estimators Compared—Single Frame Sample.

Estimator	Data scenario	Description
GREG	Type 0	Generalized Regression estimator with frame employment as the $x$ variable
RDI	Type III	The Regression Data Integration estimator of Kim and Tam (2021), where the probability sample is calibrated to big data totals $(N, N_{B}, Y_{earnings})$ . In the with-measurement error scenario, the probability sample is calibrated to the measurement error versions of the data item
QR MA	Type IV	The model-assisted QR estimator described in Medous et al. (2023). Frame employment is used as the explanatory variable
KW	Type II	Estimation of ${\hat{π}}_{k}^{B}, k \in B$ by employing the method of Kim and Wang (2019). $δ_{i}, i \in A$ is obtained by linking units in $A$ and $B$ . Frame employment and industry division are the explanatory variables in the model for ${\hat{π}}_{k}^{B}$ . Estimates are produced using (2). In the with-measurement error scenario, the measurement error versions of the data items are used in estimation
KW-Cal	Type II	A two-step weighting process starting with creation of KW weights (see above). These weights are then calibrated to population size $N$ and frame employment total. The with-measurement error scenario uses the measurement error versions of the data items on the big dataset
KW-Earn	Type II	The KW estimator with the addition of the natural logarithm of Earnings as an explanatory variable in the model for ${\hat{π}}_{k}^{B}$
ALP	Type II	The Adjusted Logistic Propensity method of Wang et al. (2021), where estimation of ${\hat{π}}_{k}^{B}, k \in B$ involves pooling data in $A$ and $B$ together and fitting a weighted logistic model on the pooled data
Wgt_Reg_MI	Type II	A mass imputation for $y$ in $A$ using weighted regression modeling. We assume in this case that $y$ variables are not available in $A$ . We fit weighted models for each $y$ data item based on the big data, where the weights are the estimated KW weights. The models are applied to the units in the probability sample to produce ${\hat{y}}_{i}, i \in A$ . (4) is used to produce estimates of total
DR_wgt	Type II	The doubly-robust estimator (8) which combines the KW and Wgt_Reg_MI approaches
HD_MI	Type II	Hot deck mass imputation method to impute $y_{i}, i \in A$ , where industry and size groups are used to form the classes that the hot deck imputation will be performed within

The dual-frame probability sample design is a screening dual-frame design. This allows us to combine the estimate from the probability sample with the big data total to form an estimate for $U$ . Two variants were considered as described in Table 3.

Table 3.

Summary of Estimators Compared—Dual Frame Sample.

Estimator	Data scenario	Description
SP	Type III	The split-population estimator (9) for $Y$ where ${\hat{Y}}_{C}$ is a Horvitz-Thompson estimator using data from $A$
SP_Cal	Type III	The split-population estimator (9) for $Y$ , where ${\hat{Y}}_{C}$ is calibrated to population totals $(N_{C}, X_{C})$ , with $X$ being Frame Employment

Table 4 describes the estimation methods applied to the cut-off probability sample.

Table 4.

Summary of Estimators Compared—Cut-Off Sample.

Estimator	Data scenario	Description
CO+BD	Type III	A Horvitz-Thompson estimate based on the cut-off sample, added to the big data total for the small size units
CO_Cal+KWFr	Type 0	An estimate from the cut-off sample calibrated to population totals which excludes the cut-off population, combined with an estimate based on the big data in the excluded part of the population which is weighted up by KW weights produced by linking to the frame to obtain $δ_{k}, k \in B$ and frame information on $X$

For comparison, we also produced estimates where population frame information on employment and industry were the auxiliary variables available in the estimation process. The estimators used are described in Table 5.

Table 5.

Summary of Estimators Compared—Big Data and Frame Only.

Estimator	Data scenario	Description
AuxDiv	Type I	The big data total in each Industry Division $d$ adjusted by an industry-specific factor $X_{d} / X_{B, d}$ , and then summed over the $D$ industries to produce the overall total: $\sum_{d \in D} (X_{d} / X_{B, d}) \sum_{k \in N_{B, d}} y_{k}$
KWFr	Type I	Estimation of ${\hat{π}}_{k}^{B}, k \in B$ by first linking the frame data to the big dataset to find $δ_{k}, k \in B$ . A logistic regression is then fitted on the frame data to estimate ${\hat{π}}_{k}^{B}$ , with frame employment and industry division as the explanatory variables in the model. Estimates are produced using (2)

5.3.1. Measurement Error Correction in the Non-Probability Sample

By adopting the measurement error model Equation (1), we can re-arrange and obtain an expression for a measurement-error corrected version of $y_{i}^{*}$ :

{\hat{y}}_{i} = {\hat{β}}_{1}^{- 1} (y_{i}^{*} - {\hat{β}}_{0})

(10)

In a Type III data scenario, the parameters $β_{0}$ and $β_{1}$ may be estimated using data from $A \cap B$ . In our simulation study we examined the performance of a few estimators when measurement error was present, as detailed in Table 6.

Table 6.

Summary of Estimators—Measurement Error Corrected Estimators.

Estimator	Data scenario	Description
KW-Cor	Type III	Apply the correction (10) based on data from $A \cap B$ . Apply the model to the data in the big dataset to create ${\hat{y}}_{k}, k \in B$ . Estimates are produced by feeding these ${\hat{y}}_{k}$ into (2), and applying the weights from the KW estimator
KW-Cal-Cor	Type III	Similar to KW-Cor, except with an additional calibration step to population size $N$ and frame employment total
CO_Cal+KWFr-Cor	Type III	Similar to the CO_Cal+KWFr cut-off sample approach, but with measurement error correction for the small units. The measurement error corrected data are then used to provide the contribution from the big dataset
KWFr-Cor	Type III	Similar to the KWFr estimator in Table 5, except now the probability sample is also utilized to fit a measurement error correction model for each data item. Measurement error corrected versions of each data item are used instead of $y^{*}$ to form estimates

5.4. Results

Table 7 provides the reference sample sizes resulting from each of the different sample designs. Sample reductions are shown relative to the single-frame design. For this simulation, the dual-frame designs provide good potential for reducing the sample size, with a saving of about 40%. When using the cut-off sampling, the resulting sample savings are about 11%. Note that we did not attempt to standardize the designs in terms of the achieved precision or RMSE of estimators, as it is unclear how such a standardization would be defined when there are many variables and estimators being considered. As a result, the sample sizes achieved by the different designs need to be considered in conjunction with the RMSE results achieved by those designs in Tables 9 to 12.

Table 7.

Sample Sizes Under Different Designs.

Sample design	Sample size	Sample reduction (%)
Single-frame	7,715	0
Dual-frame—SAR	4,559*	41
Dual-frame—SNAR	4,598*	40
Cut-off	6,883	11

Average sample size over all simulations.

In each iteration of the simulation, national level estimates were produced for four $y$ variables of interest—total weekly earnings (Earn), total reported employment (Emp), total overtime (Ovt), and average weekly earnings (AWE) defined as the ratio of total weekly earnings to total reported employment. The RB and RRMSE were calculated for the different estimators described in Tables 2 to 6, and are shown in Tables 9 to 12.

Table 8 lists the best-performing estimator in terms of RRMSE for each data item in each of the four scenarios of SAR versus SNAR and with versus without measurement error. No one estimator consistently outperforms the others in all scenarios. However, we note that the SP-Cal estimator seems to perform reasonably in the no measurement error scenario, while the GREG estimator more often performs best when there is measurement error. We next highlight our results for each of the four classes of estimators considered.

Table 8.

Best-Performing Estimators Under Different Scenarios.

	No measurement error	With measurement error
Selection At Random	Earn: SP-Cal	Earn: GREG
	Emp: KW-Cal	Emp: GREG
	Ovt: KWFr	Ovt: KWFr-Cor
	AWE: AuxDiv	AWE: KWFr
Selection Not At Random	Earn: SP-Cal	Earn: GREG
	Emp: SP-Cal	Emp: GREG
	Ovt: ALP	Ovt: KW-Cal-Cor
	AWE: ALP	AWE: DR_wgt

5.4.1. Single-Frame Design Results

The GREG (Type 0 data structure), RDI (Type III), and QR MA (Type IV) estimators rely on the probability sample data for inferences, and as expected they have negligible relative bias under all missingness and measurement error scenarios we examined. These estimators are “safe” estimators, producing robust performance. Of these three estimators, the GREG with frame employment as the auxiliary variable performs best—it is also consistently among the best performers across all estimators. The RDI estimator, which relies on $B$ only to provide calibration benchmarks for weighting $A$ , achieves good performance, coming close to the GREG. It is also asymptotically unbiased. Combining the GREG and RDI benchmarks into a single calibration process can further reduce the RRMSE of the resulting estimates, however care should be observed when choosing the set of benchmarks as including too many calibration constraints may result in an increased RRMSE, or worse yet an infeasible calibration process (as noted in Golini and Righi 2024). The addition of measurement error in $y$ erodes the performance of the RDI estimator (see Tables 10 and 12).

When there is no measurement error, the RRMSE for the QR MA estimator is slightly higher than that for the RDI estimator, reflecting a penalty due to the fact that it estimates $y_{k}$ for all units in the large dataset $B$ , while the RDI is able to utilize the real values of $y_{k}, k \in B$ . When measurement error exists in $B$ , however, the QR MA estimator provides slightly better RRMSE outcomes for Earn and Emp compared with the RDI which uses the mis-measured $y_{i}^{*}$ from $B$ in its benchmarks.

A few variants of propensity score estimator were examined in our study, applicable under a Type II data structure. The KW-Cal estimator performs very well in the ideal scenario of SAR and no measurement error (see Table 9). In general, the calibration to frame employment helps to reduce the variance of the KW estimates.

Table 9.

Monte Carlo Bias and RRMSE of Estimators Based on 2,000 Samples—SAR, No Measurement Error.

Estimator	RB ( $\times 10^{2}$ )				RRMSE ( $\times 10^{2}$ )
Estimator	Earn	Emp	Ovt	AWE	Earn	Emp	Ovt	AWE
Single-frame design
RDI	0.0	0.0	0.2	0.0	1.0	1.0	6.3	0.4
GREG	0.0	0.0	0.2	0.0	0.8	0.7	6.3	0.4
QR MA	0.0	0.0	0.2	0.0	1.3	1.2	6.4	0.4
KW	0.1	−0.2	0.3	0.2	2.0	2.0	2.4	0.5
KW-Cal	0.2	0.0	0.4	0.2	0.5	0.0	1.5	0.5
KW-Earn	−0.1	−0.3	0.1	0.2	2.2	2.2	2.6	0.5
ALP	−0.2	0.0	0.6	−0.1	1.1	1.1	1.3	0.1
Wgt_Reg_MI	0.2	0.0	0.6	0.3	1.2	1.1	1.3	0.3
DR_wgt	0.2	0.0	0.6	0.3	1.2	1.1	1.3	0.3
HD_MI	0.2	0.2	0.5	0.0	1.4	1.4	6.5	0.5
Dual-frame design
SP	0.0	0.0	0.0	0.0	0.5	0.5	2.5	0.2
SP_Cal	0.0	0.0	0.0	0.0	0.3	0.3	2.5	0.2
Cut-off design
CO+BD	−6.2	−6.3	−6.5	0.0	6.3	6.4	8.7	0.4
CO_Cal+KWFr	0.1	0.0	0.1	0.2	0.6	0.5	5.6	0.4
Big Data only
AuxDiv	−2.9	−2.8	−3.3	−0.1	2.9	2.8	3.3	0.1
KWFr	0.4	0.1	0.6	0.3	0.4	0.1	0.7	0.3

Note.“RDI” is the calibrated estimator outlined in Kim and Tam (2021); “GREG” is the Generalized Regression estimator; “QR MA” is the estimator of Medous et al. (2023); “KW” and “KW-Earn” are IPW estimators based on Kim and Wang (2019), with frame employment and log(earnings) (KW-Earn only) in the model for ${\hat{π}}_{k}^{B}$ ; “KW-Cal” is the KW estimator calibrated to $N$ and total frame employment; “ALP” is the Wang et al. (2021) estimator; “Wgt_Reg_MI” imputes $y$ in $A$ using weighted regression models fitted on $B$ ; “DR_wgt” is a doubly robust estimator; “HD_MI” is a mass imputation for $A$ using the hot deck method; “SP” is the estimator Equation (9); “SP_Cal” is Equation (9) with the probability sample calibrated to population totals; “CO+BD” is the HT estimator for the cut-off sample added to the big data total for $U_{E}$ ; “CO_Cal+KWFr” combines a calibrated total for ${\hat{Y}}_{F}$ and an estimate of ${\hat{Y}}_{E}$ using KW; “AuxDiv” is $\sum_{d \in D} (X_{d} / X_{B, d}) \sum_{k \in N_{B, d}} y_{k}$ ; “KWFr” is a KW estimator with propensities estimated using frame data linked to the big dataset.

The ALP estimator performs favorably as an alternative to the KW estimator, achieving a lower RRMSE in the four scenarios tested. In our simulation study, we found that some of the resulting propensities were larger than 1 when using this method, and this lead to IPW weights of less than 1. This, along with other conceptual issues with the pooled approach noted by Wu (2022), means the survey practitioner will need to consider how applicable this approach will be for their case. In our study, though, these issues did not hinder the effectiveness of the estimator to produce relatively efficient, unbiased estimates of population totals.

In the SAR case, the inclusion of earnings in the KW-Earn estimator did not produce reductions in RRMSE over the KW estimator, which is expected since in the SAR case missingness in $B$ does not depend on earnings (see Table 9). On the other hand, in the SNAR case where the propensity of selection into $B$ is also influenced by earnings, the inclusion of earnings in the model yielded lower RB and RRMSE (see Table 11) compared with the KW estimator.

The performance of the mass imputation approach Wgt_Reg_MI in our study was better than the KW estimator in all scenarios, although both suffered under measurement error. The inclusion of estimated propensity weights in the regression imputation model improved the performance of the regression model, aligning with findings in Castro-Martín et al. (2022). The DR_wgt estimator, which provides protection against mis-specification in one of the IPW or Regression models, tends to have comparable RRMSE compared with the Wgt_Reg_MI estimator.

The HD_MI estimator is a non-parametric mass imputation approach for $A$ . This estimator was chosen as a less computationally intensive approach (and hence faster) compared with k Nearest Neighbor imputation. Compared to the parametric regression mass imputation approach, HD_MI produced comparable results for 3 of the four data items of interest in all scenarios except the ideal SAR without measurement error case. The Ovt data item was the one item where the hot deck imputation did not perform as well as the regression model. The results suggest that, when suitable imputation classes are formed using the covariates $x$ , the hot-deck method may be a faster, simpler to implement alternative to nearest neighbor imputation.

It is worth pointing out the RRMSE results for the Overtime variable when no measurement error is present (Tables 9 and 11). The Overtime variable has a larger population variance and lower correlation with the benchmarking variable Frame Employment compared with Earnings and Reported Employment. We did not include the total value of Overtime in $B$ as a calibration constraint for the RDI and QR MA estimators, and as a result the RRMSE results for the RDI and QR MA estimators tend to be large. We could of course include the $B$ totals for all the variables of interest in the calibration, but as noted above this may lead to more variable weights and hence higher RRMSE, or an infeasible calibration process. This suggests that in a multi-purpose survey with a potentially large number of disparate variables, these estimators may not be suitable as it will be infeasible to include all of the different variables in the calibration process.

More generally for Overtime when no measurement error is present, the estimators based on the probability sample $A$ —RDI, GREG, QR MA, and HD_MI—tend to have a large RRMSE. On the other hand, the KW-based, ALP, Wgt_Reg_MI, and DR_wgt methods achieve a significantly lower RRMSE; these estimators all base their inference on the much larger dataset $B$ which helps in lowering the variance of the Overtime estimates.

From Tables 10 and 12, the KW-based, HD_MI, Wgt_Reg_MI, and DR_wgt approaches applicable under a Type II data structure all yield poor results under measurement error due to their reliance on the data in $B$ . This was true under both a SAR and SNAR setting. None of these approaches included any provision for erroneously measured data items. Measurement error correction successfully negates the bias from measurement error when the measurement error model holds. However, the cost is a much higher overall RRMSE.

Table 10.

Monte Carlo Bias and RRMSE of Estimators Based on 2,000 Samples—SAR, with Measurement Error.

Estimator	RB ( $\times 10^{2}$ )				RRMSE ( $\times 10^{2}$ )
Estimator	Earn	Emp	Ovt	AWE	Earn	Emp	Ovt	AWE
Single-frame design
RDI	0.2	0.1	0.3	0.0	1.4	1.3	6.4	0.4
GREG	0.0	0.0	0.2	0.0	0.8	0.7	6.3	0.4
QR MA	0.0	0.0	0.2	0.0	1.3	1.2	6.4	0.4
KW	−12.9	−12.8	−13.0	−0.2	13.0	12.9	13.2	0.5
KW-Cal	−12.8	−12.6	−12.9	−0.2	12.8	12.6	13.0	0.5
KW-Cor	−0.1	−0.4	−0.1	0.3	3.2	3.6	5.3	1.2
KW-Cal-Cor	0.1	−0.3	0.1	0.3	2.5	3.0	4.8	1.2
KW-Earn	−13.0	−12.9	−13.1	−0.2	13.2	13.1	13.3	0.5
ALP	−12.9	−12.4	−12.5	−0.6	13.0	12.5	12.6	0.6
Wgt_Reg_MI	−12.8	−12.6	−12.8	−0.1	12.8	12.7	12.8	0.2
DR_wgt	−12.8	−12.6	−12.8	−0.1	12.8	12.7	12.8	0.2
HD_MI	−12.8	−12.4	−12.7	−0.4	12.9	12.6	14.8	1.0
Dual-frame design
SP	−8.2	−7.9	−8.4	−0.4	8.3	7.9	8.8	0.4
SP_Cal	−8.2	−7.9	−8.4	−0.4	8.2	7.9	8.8	0.4
Cut-off design
CO+BD	−7.2	−7.1	−7.5	−0.1	7.3	7.2	9.4	0.4
CO_Cal+KWFr	−1.7	−1.7	−1.8	0.0	1.9	1.8	5.9	0.4
CO_Cal+KWFr-Cor	0.2	0.1	0.3	0.0	0.9	0.8	5.7	0.6
Big Data only
AuxDiv	−15.1	−14.7	−15.8	−0.4	15.1	14.7	15.8	0.5
KWFr	−12.7	−12.5	−12.7	−0.1	12.7	12.5	12.7	0.2
KWFr-Cor	0.2	−0.1	0.3	0.3	2.5	2.9	4.6	1.1

Note.“RDI” is the calibrated estimator outlined in Kim and Tam (2021); “GREG” is the Generalized Regression estimator; “QR MA” is the estimator of Medous et al. (2023); “KW” and “KW-Earn” are IPW estimators based on Kim and Wang (2019), with frame employment and log(earnings) (KW-Earn only) in the model for ${\hat{π}}_{k}^{B}$ ; “KW-Cal” is the KW estimator calibrated to $N$ and total frame employment; “KW-Cor” and “KW-Cal-Cor” are measurement error corrected versions of “KW” and “KW-Cal”; “ALP” is the Wang et al. (2021) estimator; “Wgt_Reg_MI” imputes $y$ in $A$ using weighted regression models fitted on $B$ ; “DR_wgt” is a doubly robust estimator; “HD_MI” is a mass imputation for $A$ using the hot deck method; “SP” is the estimator Equation (9); “SP_Cal” is Equation (9) with the probability sample calibrated to population totals; “CO+BD” is the HT estimator for the cut-off sample added to the big data total for $U_{E}$ ; “CO_Cal+KWFr” combines a calibrated total for ${\hat{Y}}_{F}$ and an estimate of ${\hat{Y}}_{E}$ using KW; “CO_Cal+KWFr-Cor” is a measurement error corrected version of “CO_Cal+KWFr-Cor”; “AuxDiv” is $\sum_{d \in D} (X_{d} / X_{B, d}) \sum_{k \in N_{B, d}} y_{k}$ ; “KWFr” is a KW estimator with propensities estimated using frame data linked to the big dataset; “KWFr-Cor” is a measurement error corrected version of “KWFr.”

5.4.2. Dual-Frame Design Results

In the without measurement error scenarios, the split-population estimators are unbiased, and their RRMSE results are generally among the lowest of all estimators we examined across SAR and SNAR settings (see Tables 9 and 11). Including a calibration to auxiliary population totals for the dual-frame estimator helps to reduce RRMSE even further for earnings and reported employment.

Table 11.

Monte Carlo Bias and RRMSE of Estimators Based on 2,000 Samples—SNAR, No Measurement Error.

Estimator	RB ( $\times 10^{2}$ )				RRMSE ( $\times 10^{2}$ )
Estimator	Earn	Emp	Ovt	AWE	Earn	Emp	Ovt	AWE
Single-frame design
RDI	0.0	0.0	−0.1	0.0	1.1	1.0	6.5	0.4
GREG	0.0	0.0	−0.1	0.0	0.8	0.7	6.4	0.4
QR MA	0.0	0.0	−0.1	0.0	1.3	1.2	6.5	0.4
KW	−3.0	−3.1	−2.9	0.1	3.6	3.6	3.9	0.5
KW-Cal	−1.5	−1.6	−1.6	0.1	1.6	1.6	2.3	0.5
KW-Earn	−0.2	−0.2	−0.1	0.0	1.9	1.8	2.5	0.5
ALP	−1.7	−1.6	−1.5	−0.1	2.1	2.0	1.9	0.1
Wgt_Reg_MI	−1.4	−1.6	−1.5	0.2	1.9	2.0	1.9	0.2
DR_wgt	−1.4	−1.6	−1.5	0.2	1.8	1.9	1.9	0.2
HD_MI	−1.3	−1.2	−1.2	−0.1	1.9	1.8	6.8	0.5
Dual-frame design
SP	0.0	0.0	0.0	0.0	0.6	0.6	2.9	0.2
SP_Cal	0.0	0.0	0.0	0.0	0.4	0.3	2.8	0.2
Cut-off design
CO+BD	−6.6	−6.6	−6.8	0.0	6.7	6.7	8.9	0.4
CO_Cal+KWFr	−0.8	−0.9	−0.8	0.1	1.0	1.0	5.7	0.4
Big Data only
AuxDiv	−3.3	−3.2	−3.9	−0.2	3.3	3.2	4.0	0.2
KWFr	−2.7	−2.8	−2.6	0.1	2.7	2.8	2.7	0.1

However, measurement error in $B$ leads to bias in the split-population estimators due to the reliance on $B$ to produce a value for $Y_{B}$ (see Tables 10 and 12). The dual-frame samples here don’t have any overlap between $A$ and $B$ . Including some overlap between the samples $A$ and $B$ may be desirable to provide some data for correcting the measurement error.

Table 12.

Monte Carlo Bias and RRMSE of Estimators Based on 2,000 Samples—SNAR, with Measurement Error.

Estimator	RB ( $\times 10^{2}$ )				RRMSE ( $\times 10^{2}$ )
Estimator	Earn	Emp	Ovt	AWE	Earn	Emp	Ovt	AWE
Single-frame design
RDI	0.2	0.2	0.1	0.0	1.4	1.3	6.5	0.4
GREG	0.0	0.0	−0.1	0.0	0.8	0.7	6.4	0.4
QR MA	0.0	0.0	−0.1	0.0	1.3	1.2	6.5	0.4
KW	−15.4	−15.1	−15.6	−0.3	15.5	15.2	15.8	0.6
KW-Cal	−14.1	−13.8	−14.5	−0.3	14.1	13.8	14.6	0.5
KW-Cor	−3.0	−3.2	−3.1	0.2	4.4	4.8	6.3	1.4
KW-Cal-Cor	−1.5	−1.7	−1.8	0.2	2.9	3.5	5.4	1.4
KW-Earn	−13.1	−12.8	−13.4	−0.4	13.2	12.9	13.6	0.6
ALP	−14.2	−13.7	−14.2	−0.5	14.2	13.8	14.3	0.5
Wgt_Reg_MI	−14.0	−13.8	−14.4	−0.3	14.1	13.8	14.4	0.3
DR_wgt	−14.0	−13.8	−14.4	−0.3	14.1	13.8	14.4	0.3
HD_MI	−14.0	−13.6	−14.2	−0.5	14.2	13.8	16.0	1.0
Dual-frame design
SP	−7.6	−7.3	−7.7	−0.3	7.6	7.3	8.2	0.4
SP_Cal	−7.6	−7.3	−7.7	−0.3	7.6	7.3	8.2	0.4
Cut-off design
CO+BD	−7.5	−7.4	−7.7	−0.1	7.6	7.5	9.6	0.4
CO_Cal+KWFr	−2.5	−2.5	−2.6	0.0	2.6	2.6	6.2	0.4
CO_Cal+KWFr-Cor	−0.7	−0.8	−0.6	0.1	1.2	1.2	5.8	0.6
Big Data only
AuxDiv	−15.5	−15.0	−16.4	−0.6	15.5	15.0	16.4	0.6
KWFr	−15.1	−14.9	−15.4	−0.3	15.1	14.9	15.4	0.3
KWFr-Cor	−2.7	−2.9	−2.8	0.2	3.6	4.2	5.6	1.3

Note.“RDI” is the calibrated estimator outlined in Kim and Tam (2021); “GREG” is the Generalized Regression estimator; “QR MA” is the estimator of Medous et al. (2023); “KW” and “KW-Earn” are IPW estimators based on Kim and Wang (2019), with frame employment and log(earnings) (KW-Earn only) in the model for ${\hat{π}}_{k}^{B}$ ; “KW-Cal” is the KW estimator calibrated to $N$ and total frame employment; “KW-Cor” and “KW-Cal-Cor” are measurement error corrected versions of “KW” and “KW-Cal”; “ALP” is the Wang et al. (2021) estimator; “Wgt_Reg_MI” imputes $y$ in $A$ using weighted regression models fitted on $B$ ; “DR_wgt” is a doubly robust estimator; “HD_MI” is a mass imputation for $A$ using the hot deck method; “SP” is the estimator Equation (9); “SP_Cal” is Equation (9) with the probability sample calibrated to population totals; “CO+BD” is the HT estimator for the cut-off sample added to the big data total for $U_{E}$ ; “CO_Cal+KWFr” combines a calibrated total for ${\hat{Y}}_{F}$ and an estimate of ${\hat{Y}}_{E}$ using KW; “CO_Cal+KWFr-Cor” is a measurement error corrected version of “CO_Cal+KWFr-Cor”; “AuxDiv” is $\sum_{d \in D} (X_{d} / X_{B, d}) \sum_{k \in N_{B, d}} y_{k}$ ; “KWFr” is a KW estimator with propensities estimated using frame data linked to the big dataset; “KWFr-Cor” is a measurement error corrected version of “KWFr.”

5.4.3. Cut-Off Design Results

The results for the estimators using the cut-off sample design show that in our case simply combining the reference sample and the portion of $B$ under the cut-off threshold did not account for all the contribution below the threshold. This was because there was a significant negative bias due to the contribution of units under the cut-off threshold which were also not in $B$ . This was exacerbated by the fact that small units are less likely to be in $B$ .

The CO_Cal+KWFr estimator aims to account explicitly for the contribution of the excluded part of the population $U_{E}$ using the KW estimator and available frame information. In the without-measurement error scenarios, this effectively corrects for the selection bias, and also results in a low RRMSE for all items except overtime. When measurement error exists, using the big data to account for the contribution of $U_{E}$ means that there will be some bias in the estimates due to measurement error. The estimator CO_Cal+KWFr-Cor which applies a measurement error model using available auxiliary information (in this case, from the frame) assists in reducing the RB, with an overall RRMSE that ranks well compared with the other estimators.

In the SAR without measurement error scenario (Table 9) there is little reason to opt for the CO_Cal+KWFr estimator rather than the KWFr estimator. In this case, where the quality of the data in $B$ is reliable, the reference sample is less necessary for producing efficient estimates. However, when SNAR or measurement error exists, the reference sample provides a safety net, providing a means to help correct the measurement error and reduce the impact from SNAR missingness in the big data.

For the worst-case SNAR with measurement error scenario, the CO_Cal+KWFr-Cor estimator yielded close to the lowest overall RRMSE for the data items of interest (see Table 12). This was the case even though the contribution from the KW estimator did not fully reflect the non-response mechanism since that estimator did not include earnings (hence there is still a negative bias in the estimates obtained). This demonstrates the benefit of not relying solely on $B$ to produce estimates (such as in the single-frame KW-based estimators). Additionally, the exclusion of the smallest units from the cut-off sample helps to reduce the overall sample variance.

5.4.4. Big Data Only Results

When aggregate auxiliary information is available from the population it can be used to help reduce selection bias, as evidenced by the performance of the AuxDiv estimator. The AuxDiv estimator does not remove all bias as it does not include all of the $x$ variables related to the probability of inclusion in $B$ . The presence of measurement error increases the bias of the estimate.

The KWFr estimator performs very well in the SAR without measurement error scenario (see Table 9), achieving close to the lowest RRMSE for all data items. This estimator uses unit-level auxiliary information for the full population to estimate the propensity scores for the large dataset $B$ . Most of the contribution to RRMSE comes from bias rather than variance.

Measurement error degrades the performance of the KWFr estimator (see Tables 10 and 12). Similarly, in the SNAR scenarios the estimator does not perform as well since it is not able to include the earnings variable in the propensity model. The results for the KWFr and AuxDiv estimators demonstrate that relying on the non-probability dataset itself for inference are likely to lead to biased estimates if there are inherent issues with either the measurement of its data items or incompleteness of available auxiliary information from the population frame.

6. Concluding Remarks

A variety of approaches, ranging from weighting-based methods, to model-based or imputation approaches, to combinations of the two, have been developed to address the faults of non-probability data. The objective of this paper is to compare how a range of these methods perform in business survey context assuming different missingness and measurement error settings for a large non-probability dataset, and a reference probability sample available to assist. The results in the paper provide valuable insight into the usefulness of these methods under various conditions, and are important for increasing the efficiency of statistics produced while reducing respondent burden and sample sizes.

When auxiliary information, related to $y$ or which effectively describe the missingness mechanism in $B$ , is available and used in the estimation process, the methods we examined can effectively account for selection bias. In the most ideal scenario of SAR missingness and no measurement error in the non-probability dataset, it is not imperative to have a reference sample which overlaps with the non-probability data. When the non-probability data is big and can be linked to the population frame the use of a calibrated split population estimator provides a beneficial combination of a low reference sample size and the best accuracy of the estimators we considered. The KW estimator calibrated to population totals also performs very well in this scenario, assuming the working model for $π_{i}^{B}$ holds.

When there is SNAR missingness in the non-probability dataset but no measurement error, the calibrated split population estimator still provides the best results. This approach is robust to the selection mechanism at play in the non-probability dataset, since we use the non-probability data as-is and supplement it with data from the reference sample to cover the contribution for the population not in the non-probability dataset. The estimator is more efficient for large non-probability datasets. In situations when the non-probability dataset is a small fraction of the population, for example a small web panel survey, the performance of the estimator will be closer to the GREG.

The presence of measurement error in the non-probability data source affects the performance of the estimators, such that the best estimator tended not to heavily rely on the non-probability data. In our study, the GREG, RDI, and QR MA estimators—all of which rely on the reference sample as the basis for estimation—tended to perform the best in the with measurement error scenarios. An advantage of the RDI estimator over the measurement error corrected estimators listed in Table 6 is that we can continue to use the $y$ values in $B$ as-is—the form of the estimator does not need to change, and remains unbiased.

The corrected cut-off estimator, which combines the cut-off sample contribution with an estimate for the excluded part of the population, also performs well under measurement error. Further research could be beneficial for determining appropriate cut-off thresholds for the reference sample which take advantage of the ability to model or estimate the contribution from the excluded part of the population based on the available non-probability data.

Collecting information on the data item(s) of interest in the probability sample would assist with developing a model to correct for measurement error. One factor to be wary of is that the measurement error model may introduce additional variability into the estimates, so that we may choose instead to use the probability sample as the basis for inference and not rely on the non-probability data.

Although the RDI estimator is never the best-performing estimator, it is (asymptotically) unbiased, and has a reasonably low RRMSE in all scenarios. The estimator is also robust to SNAR situations as well as the presence of measurement error, but its efficiency can suffer under those less-than-ideal scenarios. When data for $y$ is available in both $A$ and $B$ (Type III data structure) then the use of the RDI estimator, with the addition of population level benchmarks in the calibration, could be a relatively safe, low-risk approach which yields some gains while potentially not maximizing them. We note the need to ensure the total number of benchmarks applied does not become excessively large. One could also combine the RDI estimator with a split-population approach in the design of $A$ to obtain a more efficient reference sample.

Many of the methods that have been developed make two assumptions: ignorability, which implies a SAR scenario in the non-probability data source, and common support, which implies that the probability of being in the non-probability dataset $B$ is non-zero for $i \in U$ . These assumptions may not necessarily hold. In the SNAR scenario, it may not be appropriate to just include all available data items in the propensity model. Methods are being developed to address the SNAR missingness scenario—see for example Marella (2023) and Kim and Morikawa (2023). Y. Chen et al. (2023) consider approaches to estimation when the common support assumption is violated and some part of the population does not have any chance of being selected into the non-probability sample. This may occur, for instance, with some social media datasets. Further development of methods that require fewer assumptions to be effective is an area for future research.

Including all $y$ data items in the propensity score model is not necessarily helpful as a catch-all approach to include potentially helpful covariates to explain SNAR non-response. This begs the question: How can we test whether to use an approach based on SAR or one that assumes a SNAR situation, and how do we determine which $y$ data item(s) need to be included in the SNAR model? These questions will be explored further by the authors. Meng (2018, 2022) has put forward the notion of data defect correlation and suggested to miniaturize this quantity to eliminate bias. Work such as that conducted by Andridge et al. (2019) on evaluating potential selection bias due to non-ignorable selection may be another avenue to pursue.

Supplemental Material

sj-docx-1-jof-10.1177_0282423X241298243 – Supplemental material for An Empirical Comparison of Methods to Produce Business Statistics Using Non-Probability Data

Supplemental material, sj-docx-1-jof-10.1177_0282423X241298243 for An Empirical Comparison of Methods to Produce Business Statistics Using Non-Probability Data by Lyndon Ang, Robert Clark, Bronwyn Loong and Anders Holmberg in Journal of Official Statistics

Footnotes

Acknowledgements

The authors are grateful to three anonymous referees and an associate editor for their constructive comments, which have improved this article greatly. The authors would like to thank Dr. Siu-Ming Tam and Dr. Ryan Covey for their constructive comments on an early draft of this manuscript. The first author benefited from email exchanges with the authors of some of the methods investigated in this paper, including Prof. Yan Li and Prof. Pengfei Li.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The first author was supported by funding from the Sir Roland Wilson Foundation and the Australian Bureau of Statistics.

Disclaimer

The views expressed in this paper are those of the authors and do not necessarily represent the views of the Australian Bureau of Statistics. Where quoted or used, they should be attributed clearly to the authors.

ORCID iD

Lyndon Ang

Supplemental Material

Supplemental material for this article is available online.

Received: May 2024

Accepted: October 2024

References

Andridge

R. R.

West

B. T.

Little

R. J. A.

Boonstra

P. S.

Alvarado-Leiton

2019. “Indices of Non-Ignorable Selection Bias for Proportions Estimated from Non-Probability Samples.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 68 (5): 1465–83. DOI: https://doi.org/10.1111/rssc.12371.

Australian Bureau of Statistics. 2022b. “Monthly Business Turnover Indicator Methodology.”October. https://www.abs.gov.au/methodologies/monthly-business-turnover-indicator-methodology/oct-2022 (accessed December 22, 2022).

Beaumont

J.-F.

2020. “Are Probability Surveys Bound to Disappear for the Production of Official Statistics?” Survey Methodology 46 (1): 1–28. Paper available at: https://www150.statcan.gc.ca/n1/pub/12-001-x/2020001/article/00001-eng.htm

Bethel

1989. “Sample Allocation in Multivariate Surveys.” Survey Methodology 15 (1): 47–57. Paper available at: https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X198900114578

Boonstra

P. S.

Little

R. J.

West

B. T.

Andridge

R. R.

Alvarado-Leiton

2021. “A Simulation Study of Diagnostics for Selection Bias.” Journal of Official Statistics 37 (3): 751–69. DOI: https://doi.org/10.2478/jos-2021-0033.

Breidt

F. J.

Opsomer

J. D.

2017. “Model-Assisted Survey Estimation with Modern Prediction Techniques.” Statistical Science 32 (2): 190–205. DOI: https://doi.org/10.1214/16-STS589.

Burakauskaitė

Čiginas

2023. “An Approach to Integrating a Non-Probability Sample in the Population Census.” Mathematics 11 (8): 1782. DOI: https://doi.org/10.3390/math11081782.

Business Longitudinal Analysis Data Environment (BLADE). 2020. Microdata: Pay-As-You-Go. Dataset. ABS DataLab.

Castro-Martín

Rueda

M. d. M.

Ferri-García

2022. “Combining Statistical Matching and Propensity Score Adjustment for Inference from Non-Probability Surveys.” Journal of Computational and Applied Mathematics 404: 113414. DOI: https://doi.org/10.1016/j.cam.2021.113414.

10.

Chen

Haziza

2023. “General Purpose Multiply Robust Data Integration Procedures for Handling Nonprobability Samples.” Scandinavian Journal of Statistics 50 (2): 697–724. DOI: https://doi.org/10.1111/sjos.12605.

11.

Chen

Yang

Kim

J. K.

2022. “Nonparametric Mass Imputation for Data Integration.” Journal of Survey Statistics and Methodology 10 (1): 1–24. DOI: https://doi.org/10.1093/jssam/smaa036.

12.

Chen

2020. “Doubly Robust Inference with Nonprobability Survey Samples.” Journal of the American Statistical Association 115 (532): 2011–21. DOI: https://doi.org/10.1080/01621459.2019.1677241.

13.

Chen

2023. “Dealing with Undercoverage for Non-Probability Survey Samples.” Survey Methodology 49 (2): 497–515. Paper available at: https://www150.statcan.gc.ca/n1/pub/12-001-x/2023002/article/00005-eng.htm

14.

Chipperfield

Chessman

Lim

2012. “Combining Household Surveys Using Mass Imputation to Estimate Population Totals.” Australian & New Zealand Journal of Statistics 54 (2): 223–38. DOI: https://doi.org/10.1111/j.1467-842X.2012.00666.x.

15.

Chromy

1987. “Design Optimization with Multiple Objectives.”Proceedings of the Section on Survey Research Methods. American Statistical Association.

16.

Deville

J.-C.

Särndal

C.-E.

1992. “Calibration Estimators in Survey Sampling.” Journal of the American Statistical Association 87 (418): 376–82. DOI: https://doi.org/10.2307/2290268.

17.

Elisson

Elvers

2001. “Cut-Off Sampling and Estimation.”Proceedings of Statistics Canada Symposium 2001.

18.

Elliott

M. R.

Valliant

2017. “Inference for Nonprobability Samples.” Statistical Science 32 (2): 249–64. DOI: https://doi.org/10.1214/16-STS598.

19.

Ferri-García

Rueda

M. d. M.

2020. “Propensity Score Adjustment Using Machine Learning Classification Algorithms to Control Selection Bias in Online Surveys.” PLoS One 15 (4): e0231500. DOI: https://doi.org/10.1371/journal.pone.0231500.

20.

Golini

Righi

2024. “Integrating Probability and Big Non-Probability Samples Data to Produce Official Statistics.” Statistical Methods & Applications 33: 555–80. DOI: https://doi.org/10.1007/s10260-023-00740-y.

21.

Hartley

H. O.

1962. “Multiple Frame Surveys.”Proceedings of the Social Statistics Section. American Statistical Association.

22.

Hartley

H. O.

1974. “Multiple Frame Methodology and Selected Applications.” Sankhya C 36: 99–118.

23.

Haziza

Chauvet

Deville

J.-C.

2010. “Sampling and Estimation in the Presence of Cut-Off Sampling.” Australian & New Zealand Journal of Statistics 52 (3): 303–19. DOI: https://doi.org/10.1111/j.1467-842X.2010.00584.x.

24.

Hidiroglou

M. A.

Lavallée

2009. “Sampling and Estimation in Business Surveys.” In Handbook of Statistics: Design, Methods and Applications, edited by Pfeffermann

Rao

C. R.

, vol. 29A. Elsevier. DOI: https://doi.org/10.1016/S0169-7161(08)00017-5.

25.

Kim

J. K.

Morikawa

2023. “An Empirical Likelihood Approach to Reduce Selection Bias in Voluntary Samples.” Calcutta Statistical Association Bulletin 75 (1): 8–27. DOI: https://doi.org/10.1177/00080683231186488.

26.

Kim

J. K.

Wang

2019. “Sampling Techniques for Big Data Analysis.” International Statistical Review 87 (1): 177–91. DOI: https://doi.org/10.1111/insr.12290.

27.

Kim

J.-K.

Tam

S.-M.

2021. “Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference.” International Statistical Review 89 (2): 382–401. DOI: https://doi.org/10.1111/insr.12434.

28.

Liu

A.-C.

Scholtus

De Waal

2023. “Correcting Selection Bias in Big Data by Pseudo-Weighting.” Journal of Survey Statistics and Methodology 11 (5): 1181–203. DOI: https://doi.org/10.1093/jssam/smac029.

29.

Lohr

Rao

J. N. K.

2006. “Estimation in Multiple-Frame Surveys.” Journal of the American Statistical Association 101 (475): 1019–30. DOI: https://doi.org/10.1198/016214506000000195.

30.

Lohr

S. L.

2011. “Alternative Survey Sample Designs: Sampling with Multiple Overlapping Frames.” Survey Methodology 37 (2): 197–213. Paper available at: https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X201100211608

31.

Lohr

S. L.

2021. “Multiple-Frame Surveys for a Multiple-Data-Source World.” Survey Methodology 47 (2): 229–63. Paper available at: https://www150.statcan.gc.ca/n1/pub/12-001-x/2021002/article/00008-eng.htm

32.

Lohr

S. L.

Raghunathan

T. E.

2017. “Combining Survey Data with Other Data Sources.” Statistical Science 32 (2): 293–312. DOI: https://doi.org/10.1214/16-STS584.

33.

Marella

2023. “Adjusting for Selection Bias in Nonprobability Samples by Empirical Likelihood Approach.” Journal of Official Statistics 39 (2): 151–72. DOI: https://doi.org/10.2478/jos-2023-0008.

34.

Medous

Goga

Ruiz-Gazen

Beaumont

J.-F.

Dessertaine

Puech

2023. “QR Prediction for Statistical Data Integration.” Survey Methodology 49 (2): 385–410. Paper available at: https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X202300200009

35.

Meng

X.-L.

2018. “Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. DOI: https://doi.org/10.1214/18-AOAS1161SF.

36.

Meng

X.-L.

2022. “Comments on ‘Statistical Inference with Non-Probability Survey Samples’– Miniaturizing Data Defect Correlation: A Versatile Strategy for Handling Non-Probability Samples.” Survey Methodology 48 (2): 339–60. Paper available at: https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00006-eng.htm

37.

Neyman

1934. “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection.” Journal of the Royal Statistical Society 97 (4): 558–625. DOI: https://doi.org/10.2307/2342192.

38.

Rao

J. N. K.

2021. “On Making Valid Inferences by Integrating Data from Surveys and Other Sources.” Sankhya B 83 (1): 242–72. DOI: https://doi.org/10.1007/s13571-020-00227-w.

39.

Rao

J. N. K.

Fuller

2017. “Sample Survey Theory and Methods: Past, Present, and Future Directions.” Survey Methodology 43 (2): 145–60. Paper available at: https://www150.statcan.gc.ca/n1/pub/12-001-x/2017002/article/54888-eng.htm

40.

Righi

Bianchi

Nurra

Rinaldi

2019. “Integration of Survey Data and Big Data for Finite Population Inference in Official Statistics: Statistical Challenges and Practical Applications.” Statistica & Applicazioni 17 (2): 135–58. DOI: https://doi.org/10.26350/999999_000025.

41.

Rivers

2007. “Sampling for Web Surveys.”In Proceedings of the Joint Statistical Meetings 2007, Section on Survey Research Methods, Salt Lake City, UT, July 29–August 2.

42.

Rubin

D. B.

1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92. DOI: https://doi.org/10.2307/2335739.

43.

Rueda

M. d. M.

Pasadas-del-Amo

Rodríguez

B. C.

Castro-Martín

Ferri-García

2023. “Enhancing Estimation Methods for Integrating Probability and Nonprobability Survey Samples with Machine-Learning Techniques. An Application to a Survey on the Impact of the COVID-19 Pandemic in Spain.” Biometrical Journal 65 (2): 2200035. DOI: https://doi.org/10.1002/bimj.202200035.

44.

Salvatore

2023. “Inference with Non-Probability Samples and Survey Data Integration: A Science Mapping Study.” Metron 81 (1): 83–107. DOI: https://doi.org/10.1007/s40300-023-00243-6.

45.

Särndal

C.-E.

Swensson

Wretman

1992. Model Assisted Survey Sampling. Springer-Verlag Publishing. DOI: https://doi.org/10.1007/978-1-4612-4378-6.

46.

Savitsky

T. D.

Williams

M. R.

Gershunskaya

Beresovsky

2023. “Methods for Combining Probability and Nonprobability Samples Under Unknown Overlaps.” Statistics in Transition New Series 24 (5): 1–34. DOI: https://doi.org/10.59170/stattrans-2023-061.

47.

Statistics Canada. 2018. “Modernizing the National Statistical System - Stakeholder Consultations.” July 26, 2018. https://www150.statcan.gc.ca/n1/pub/89-20-0003/892000032019001-eng.htm (accessed November 21, 2022).

48.

Tam

S.-M.

Clarke

2015. “Big Data, Official Statistics and Some Initiatives by the Australian Bureau of Statistics: Big Data and the ABS.” International Statistical Review 83 (3): 436–48. DOI: https://doi.org/10.1111/insr.12105.

49.

Tam

S.-M.

Holmberg

2020. “New Data Sources for Official Statistics – A Game Changer for Survey Statisticians?” The Survey Statistician 81: 21–35. DOI: https://isi-iass.org/home/wp-content/uploads/Survey_Statistician_2020_January_N81_02.pdf

50.

Tillé

Debusschere

Luomaranta

, et al. 2022. “Some Thoughts on Official Statistics and Its Future (with Discussion).” Journal of Official Statistics 38 (2): 557–98. DOI: https://doi.org/10.2478/jos-2022-0026.

51.

Wang

Valliant

2021. “Adjusted Logistic Propensity Weighting Methods for Population Inference Using Nonprobability Volunteer-Based Epidemiologic Cohorts.” Statistics in Medicine 40 (24): 5237–50. DOI: https://doi.org/10.1002/sim.9122.

52.

Wright

R. L.

1983. “Finite Population Sampling with Multivariate Auxiliary Information.” Journal of the American Statistical Association 78 (384): 879–84. DOI: https://doi.org/10.1080/01621459.1983.10477035.

53.

2022. “Statistical Inference with Non-Probability Survey Samples.” Survey Methodology 48 (2): 283–311. Paper available at: https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm

54.

Yang

Kim

J. K.

2020. “Asymptotic Theory and Inference of Predictive Mean Matching Imputation Using a Superpopulation Model Framework.” Scandinavian Journal of Statistics 47 (3): 839–61. DOI: https://doi.org/10.1111/sjos.12429.

55.

Yang

Kim

J.-K.

Hwang

2021. “Integration of Data from Probability Surveys and Big Found Data for Finite Population Inference Using Mass Imputation.” Survey Methodology 47 (1): 29–58. Paper available at: https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm

56.

Yorgason

Bridgman

Cheng

, et al. 2011. “Cutoff Sampling in Federal Surveys: An Inter-Agency Review.” Proceedings of the American Statistical Association, Section on Government Statistics. https://www.bls.gov/osmr/research-papers/2011/st110050.htm.

57.

Zhang

L.-C.

2019. “On Valid Descriptive Inference from Non-Probability Sample.” Statistical Theory and Related Fields 3 (2): 103–13. DOI: https://doi.org/10.1080/24754269.2019.1666241.

58.

Zhang

L.-C.

2021. “Proxy Expenditure Weights for Consumer Price Index: Audit Sampling Inference for Big-Data Statistics.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 184 (2): 571–88. DOI: https://doi.org/10.1111/rssa.12632.

59.

Zhang

L.-C.

2023. “Audit Sampling as a Quality Standard for Multisource Official Statistics.” Spanish Journal of Statistics 5 (1): 67–83. DOI: https://doi.org/10.37830/SJS.2023.1.05.

60.

Zhu

Gamble

L. J.

Klapman

Xue

Lesser

V. M.

2023. “Using Auxiliary Information in Probability Survey Data to Improve Pseudo-Weighting in Nonprobability Samples: A Copula Model Approach.” Journal of Survey Statistics and Methodology. Published electronically September 12 2023. DOI: https://doi.org/10.1093/jssam/smad032.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.04 MB