Sage Journals: Discover world-class research

Abstract

Data from online non-probability samples are often analyzed as if they were based on a simple random sample drawn from the general population. As the exact sampling frame for these non-probability samples are usually unknown, there is no general method to construct unbiased estimators. This raises the question of whether estimates based on online non-probability samples are consistent across sample vendors and concerning estimates based on probability samples. To address this question, we analyze data collected from eight different online non-probability sample vendors and one online probability-based sample. We find that estimates from the different non-probability samples can be very inconsistent. We suggest averaging estimates across multiple vendor samples to avoid the risk of a maximum estimation error. We evaluate several averaging approaches, including a LASSO regression procedure which identifies a subset of vendors that, when averaged, produce estimates that are more consistent with the reference probability-based estimates, compared to any single vendor. Our results show that estimates based on different vendors’ samples display different selection biases, but there is also some commonality among some vendor-specific estimates, thus there could be strong gains in estimation precision by averaging across a selection of multiple non-probability sample vendors.

Keywords

model-averaging non-probability samples regression model sample selection bias LASSO

1. Introduction

The rise of the internet created a plethora of opportunities to collect social survey data online. Web surveys now outpace all traditional “offline” survey modes, such as mail, telephone, and face-to-face (Callegaro et al. 2014). But unlike traditional survey modes which have well-defined population frames, Web surveys face a major challenge in drawing probability-based samples that are representative of the general population. Although some Web surveys use offline sampling methods to recruit probability samples of the general population (Blom et al. 2016), these methods are costly and often exceed survey budgets. Therefore, out of necessity, many researchers and private companies, instead draw non-probability samples from large pools of individuals who voluntarily sign up to become members of an online panel.

The difficulty of performing statistical inference based on non-probability samples is relevant for academic research, where analyses based on non-probability Web samples are increasingly making their way into the scientific literature (e.g., Alcántara et al. 2017; Bernauer and Gampfer 2015; Dockrell et al. 2013; Gathergood and Wylie 2018; Stancu et al. 2016), with little guarantee that results derived from one Web survey can be replicated in another. The use of non-probability samples is often due to the prohibitive cost of collecting a traditional probability sample. Therefore, it is important for researchers to understand the extent to which an arbitrarily chosen web survey vendor produces similar (or different) errors compared to another vendor. If there is variability across non-probability sample vendors with respect to the errors they produce, researchers could potentially exploit this variability and improve their estimates by purchasing multiple samples from different vendors and averaging the samples in order to minimize the risk of basing their inference on a single, potentially very selective sample.

In the present study, we address these issues by examining the variability in estimates obtained from several non-probability Web surveys conducted by multiple vendors in parallel, and demonstrate the advantages of averaging survey estimates across the different vendors versus simply commissioning a sample with a single vendor. In addition, we apply a LASSO procedure to investigate whether a more optimal subset of vendors can be identified for averaging that improves upon a simple “take-all” averaging approach. The LASSO procedure’s success would suggest that by piecing together estimates from different vendor samples into a more accurate single estimate the different estimation errors associated with different non-probability samples could help cancel one another out, resulting in an estimation error that is closer to the one of a probability sample from the target population.

1.1. Background/Research Questions

Most non-probability opt-in panels exclusively use online methods to recruit panelists, thus samples drawn from them will automatically omit non-internet users as well as internet users who did not join the panel. Given that these segments of the general population will be excluded from selection, there is a risk that samples drawn from opt-in panels result in biased estimates. Indeed, comparison studies tend to show that probability-based surveys produce more accurate estimates than non-probability surveys across a variety of outcome variables (Cornesse et al. 2020). Moreover, comparisons between online non-probability surveys fielded by different panel vendors suggest that there is significant variability in the estimates obtained from them (Blom, Ackermann-Piek, et al. 2017; Kennedy et al. 2016; MacInnis et al. 2018; Yeager et al. 2011).

These inconsistent results suggest that researchers may not obtain similar results from different opt-in panels. This poses a challenge for researchers interested in using online non-probability surveys to produce reliable and reproducible estimates of the general population. If the error properties of non-probability surveys are constant across panel vendors, then researchers can be confident that they will obtain similar results regardless of which vendor they choose. However, if the error properties of non-probability samples differ widely across sample providers—which the empirical literature suggests to be the case—then relying on a single arbitrary panel provider is a risky proposition as the purchased vendor sample may lead to larger estimation errors than those provided by alternative vendors. It may be more prudent for the researcher to allocate portions of their planned sample across multiple vendors, accounting for variability across different vendor samples and avoiding inadvertently basing their entire analysis on a single vendor that yields sample estimates with the highest error.

In practice, researchers typically contract only one panel vendor to carry out their Web survey. However, from a minimax perspective (Savage 1951), the alternative strategy of contracting multiple vendors to conduct the survey is more advisable, as the researcher minimizes their “maximum regret” by avoiding inadvertently contracting the vendor whose sample is associated with the largest estimation error. Minimax estimators have a long history in statistics, with Wald developing an early example (Wald 1939). Gabler et al. (2000), Hodges and Lehmann (2011), and Inada (1984) continued to develop minimax estimators, among many others.

The estimates produced from each vendor sample could then be averaged, giving equal weight to each sample, to produce an overall estimate that has a smaller estimation error than the sample with the largest possible error. If there is reason to believe that a set of vendor samples might have better data quality compared to the others, then a weighted average could be performed, giving higher weight to those vendors. This is the approach taken by poll aggregators, such as FiveThirtyEight, who perform a weighted average across multiple polls giving higher weight to the higher-rated polls (Silver 2019). However, typically there is very little (if any) information available that might distinguish between levels of data quality for different panel vendors with respect to a specific outcome variable. Identifying reliable quality indicators for non-probability samples is an open area of research.

In addition to averaging estimates across a large set of arbitrary vendor samples, a key question is whether a smaller set of vendors exits which would improve the accuracy of the averaged estimates even further. To address this question, we consider a regression analysis method, specifically, the least absolute shrinkage and selection operator (or LASSO). This method is used to determine whether some non-probability sample vendors provide redundant information through common error sources while other vendors provide distinct information that can be exploited to enhance the estimation accuracy for a given substantive outcome.

Our study aims to addresses the following research questions:

Do non-probability samples furnished by different panel vendors produce common errors, resulting in consistent survey estimates with similar accuracy? In other words, to what extent can researchers be confident that a non-probability survey conducted by one, arbitrarily chosen, vendor will produce estimates with similar accuracy compared to a different vendor?

Is it beneficial from an error perspective to exploit between-vendor variation by averaging survey estimates across multiple vendor samples?

Does there exist an optimal and parsimonious subset of vendor samples that further minimizes error in the averaged survey estimates, relative to averaging across all-available vendor samples? If so, future work could explore whether or not these patterns were consistent over time, enabling researchers to better optimize their selection of vendors

To address these questions the article proceeds as follows. First, we introduce our methodological framework for inference based on probability and non-probability samples (Subsection 2.1). After this, we discuss averaging estimators across different vendors (Subsection 2.2) and extend this idea to apply a data-driven LASSO-based procedure to investigate whether a more optimal and parsimonious subset of vendor samples exists that improves the accuracy of the averaged estimates (Subsection 4.2). Note that we do not examine methods which use a reference probability sample to predict the probability of being included in a non-probability sample (as in Elliott and Valliant (2017)), as our focus in this paper is on the differences between non-probability samples. In Section 3 we describe the data used to carry out our study. We then proceed to discuss the results of our analysis by assessing the extent to which non-probability samples furnished by different vendors produce consistent estimates for several substantive models of interest and then demonstrate the benefits of averaging the estimates across multiple vendor samples in order to minimize the risk of a worst-case error (Subsection 4.1). In Subsection 4.2 we show the results of our LASSO-based averaging estimators. Finally we discuss our proposed methods including their implications for collection data using non-probability sample surveys in Section 5.

2. Methods

2.1. Theoretical Framework

In this section we introduce some formal definitions and notation used to describe probability and non-probability sample surveys. A probability sample $s = {1, \dots, n}$ is defined as a subset of size $n$ from a finite population of elements $U = {1, \dots, N}$ , that is $s \subset U$ , selected by a sampling algorithm for which each element has a known and non-zero selection probability. A probability-based sampling design is discrete probability mass function $p (.)$ over the set of all-possible samples induced by the sampling algorithm.

It should be noted that in practice $p (.)$ might not be known completely, especially for complex sampling designs, however this does not need to be a hindrance to design-based sample inference. If the sampling algorithm has the property that it will select every $i \in U$ with a known and non-zero probability, $π_{i}$ , then unbiased estimation is possible. Probability $π_{i}$ is the so-called first-order inclusion probability of $i \in U$ , that is, it is the probability of element $i$ to included into a possible sample. Furthermore if the second-order inclusion probabilities generated by the sampling algorithm are (approximately) known then also direct unbiased sampling variance estimation is possible, enabling statistical testing and construction of valid confidence intervals, if certain asymptotic conditions are meet. Replication-based sampling variance estimation methods are often used.

We define a non-probability sample $\tilde{s} = {1, \dots, \tilde{n}}$ also as a set of elements of size $\tilde{n}$ . Further we assume that the sampling design of a non-probability sample $\tilde{p} (.)$ is unknown but also the properties of the sampling algorithm to select $\tilde{s}$ are unknown, thus $π_{i}$ for all $i \in U$ is also unknown, and some may be zero. In addition we do not assume that $\tilde{s}$ is a strict subset of $U$ , that is $\tilde{s}$ may include elements that are not part of the target population. These assumptions make design-based inference impossible, as we cannot say anything about the asymptotic properties of estimators. Thus is not possible to put any bounds on the error of estimates based only on a non-probability sample.

To compare a survey with a probability sample to one with a non-probability sample, we assume that in both cases the same measurement instrument is applied, that is, both surveys measure the same variables for the elements in their respective samples. Now let $y$ be the variable of interest and $x = (x_{1}, \dots, x_{P})$ a set of covariates used to predict the values $y_{i}$ with $i \in U$ .

We are interest in the following finite population regression model:

\begin{matrix} {\hat{y}}_{i} = μ (x_{i}^{⊤} β) \end{matrix}

(1)

where ${\hat{y}}_{i}$ is the so called fitted value for the variable of interest of the $i$ -th element, $x_{i} = {(x_{i 1}, \dots, x_{ip}, \dots, x_{iP})}^{⊤}$ is a $P \times 1$ vector containing the measurements of the covariates of the $i$ -th element, $β = {(β_{1}, \dots, β_{p}, \dots, β_{P})}^{⊤}$ is the $P \times 1$ vector containing the regression coefficients, which is a finite population parameter, and $μ (.)$ a function whose form depends on the scale of $y$ . For continuous variables we will use $μ (t) = t$ , for binary variables $μ (t) = \frac{\exp (t)}{1 + \exp (t)}$ , and for positive discrete variables $μ (t) = \exp (t)$ .

The lack of information on $\tilde{p} (.)$ makes it particularly difficult to assess the extent of informative sampling (Pfeffermann and Rao 2010, chap. 39), that is, that the observed data follows a different distribution than the population model, which we can describe as:

E_{\tilde{s}} (y_{\tilde{s}} | x_{\tilde{s}}) = E_{U} (\frac{\tilde{p} (\tilde{s} | y_{\tilde{s}}, x_{\tilde{s}}) y_{\tilde{s}}}{\tilde{p} (\tilde{s} | x_{\tilde{s}})} | x_{\tilde{s}}) \neq E_{U} (y_{\tilde{s}} | x_{\tilde{s}}),

(2)

where $(y_{\tilde{s}}, x_{\tilde{s}})$ is the data associated with sample $\tilde{s}$ , $E_{\tilde{s}}$ is the expectation with respect to the probability density function of the data included in $\tilde{s}$ , and $E_{U}$ the expectation with respect to the population probability density function. As 2 shows, only if $\tilde{p} (\tilde{s} | y_{\tilde{s}}, x_{\tilde{s}}) = \tilde{p} (\tilde{s} | x_{\tilde{s}})$ , can we estimate the model of interest 1 ignoring the sampling design $\tilde{p}$ .

The existing literature on treating informative sampling uses auxiliary variables that describe the sampling (Pfeffermann 2011). While access to full design information and the estimation of response propensities for probability samples are in practice only possible for survey managers, such information can be passed on to analysts via the inclusion of survey weights, stratification, and clustering variables in the dataset. The absence of a known sampling frame, the reliance on willing respondents to self-select into the online access panel, and the general inability to model the recruitment process makes it impractical for non-probability surveys to supply analysts with the same information to account for informative sampling as probability-based surveys are able to. Often the most non-probability surveys can do is to provide some post-stratification or calibration weights (Särndal and Lundström 2005), which are used to force representativeness of the weighted sample distribution toward the margins of a reference population for some limited set of covariates, for example, categories of age, gender, geographical region. Whether these weights are suitable or sufficient to account for informative sampling cannot be tested in the absence of any reference data, such as from a population register or probability sample with a known sampling design.

Suppose we would like to estimate a statistic $θ$ of population $U$ . If 2 holds, then the result would be that estimators based on probability and non-probability sampling would not have the same expected values.

If we have access to a probability sample, with an unbiased estimation strategy $E_{p} (\hat{θ}) = θ$ , then we can denote

E_{\tilde{p}} ({\hat{θ}}_{\tilde{s}}) - E_{p} ({\hat{θ}}_{s})

(3)

as the selection bias of the sampling design $\tilde{p}$ in combination with estimator $\hat{θ}$ , which we can estimate by ${\hat{θ}}_{\tilde{s}} - {\hat{θ}}_{s}$ .

Given our restriction that the non-probability surveys under consideration—that they measure the same variables as a probability survey—in combination with the limited control that analysts usually have over $\tilde{p} (.)$ , it is reasonable to assume that the observable data follows a different distribution than the population model (see Condition 2), if there is no indication to think otherwise. Thus, there is no reason to assume that any particular non-probability sample should be better suited as a basis for inference on $U$ , prior to examining the data. If there is no reason to assume one vendor performs better than another, then one vendor can be selected arbitrarily from all-available vendors or with equal probability. However, following the logic of minimax regret, selecting just one vendor would then be a hazardous estimation strategy, due to uncontrollable sampling error (i.e., the worst vendor might be chosen). To avoid having the largest possible estimation error, the researcher should use data from at least two vendors and average them. Minimax regret and its application to the above selection problem are discussed in the next section.

2.2. Averaging Model Estimates Across Multiple Non-Probability Surveys

We primarily consider the situation where the researchers are interested in making model-based predictions for certain outcome variables, where the estimated regression coefficients are used to make predictions. If the different non-probability datasets result in different estimation errors (e.g., sign, magnitude, etc.) as expected, then given our ignorance as to which dataset is best, the same regression model can be fit on each dataset and the different coefficients can averaged. This prevents results with unusually high estimation errors from producing extremely wrong predictions, while taking advantage of datasets that make different errors. Datasets that make opposite errors can help cancel out the errors, resulting in more accurate predictions.

As a simple numerical example, let us consider individual level predictions made by two models fit on separate non-probability datasets, with the first model predicting $[1, 0, - 1]$ , and the second model predicting $[- 1, 0, 1]$ . The true population data (unknown to us) is $[0, 0, 0]$ . Both models—when considered separately—then have a mean squared error (MSE) of $2 / 3$ . Averaging the two produces: $(1 / 2) * [1, 0, - 1] + (1 / 2) * [- 1, 0, 1] = [0, 0, 0]$ , reducing the MSE from $2 / 3$ to $0$ .

The aim is to minimize the estimates’ maximum (worst-case) distance from the true value while being ignorant of which dataset(s) is (are) best. Making a prediction based on a single dataset while ignorant of quality has a chance of producing a worst-case situation, as the researcher could end up picking the worst dataset by chance. Averaging never performs as poorly as the worst dataset (unless all datasets are equally good or bad). Conducting a weighted average based on a priori information (e.g., costs, reputation, or even sample size) will always have a chance of inducing more distance from the true value, as extra weight could be inadvertently given to the worst dataset(s).

Next, we provide a formal description of the estimation problem and briefly discuss model parameter averaging. Suppose there is a set of $K$ vendors, $K = {1, \dots, K}$ , that could be contracted to conduct a survey with the same measurement instrument, and the researcher’s aim is to estimate $β$ in Equation (1). To estimate the model in Equation (1) one could thus collect data from different non-probability sample vendors. Denote the set of non-probability samples from the different vendors as $\tilde{S} = {{\tilde{s}}_{1}, \dots, {\tilde{s}}_{k}, \dots, {\tilde{s}}_{K}}$ , with ${\tilde{s}}_{k}$ as the $k$ -th non-probability sample.

For any given combination of vendors the following estimator can be computed:

{\bar{\tilde{β}}}_{A} = \frac{\sum_{k \in A} {\tilde{β}}_{k}}{| A |}

(4)

where ${\tilde{β}}_{k}$ is the estimator for parameter $β$ in Equation (1) for non-probability sample $k \in K$ , and $A \in P {(K)}_{> \emptyset}$ . $P (.)$ is the power set (set of all possible sets) for any given set. To avoid violating the minimax regret rule, one needs to select at least $2$ vendors, that is, $| A | > 1$ . The question is, could further improvements be realized by selecting a specific combination of vendors and/or weighting the non-probability sample estimates differently than $1 / | A |$ ? If Condition 2 holds, the decision on which vendors to select and how to weight them should be informed by data that enables unbiased estimation of the parameters of interest $β$ .

2.3. LASSO Averaging

To make an informed selection of vendors, reference data are needed. A probability sample $s$ that includes $n_{s}$ elements from the same target population surveyed using the same measurement instrument as any of the non-probability surveys fulfills this need. While it may seem odd to collect non-probability samples when a reference probability sample is already available, doing so allows us to explore variation in vendor accuracy across many variables, and might reveal if particular vendors perform consistently better or worse. Then it is possible to identify the best combination of vendors, by solving the following optimization problem:

mi n_{τ \in R^{K \times 1}} d (E_{p} (y | x), \sum_{k = 1}^{K} τ_{k} E_{\tilde{p}} (y | E_{p} (x)))

(5)

where $d$ is a distance function (a metric mapping a set of real numbers to $R_{\geq 0}$ ) and $τ = (τ_{1}, \dots, τ_{k}, \dots, τ_{K})$ . Parameter $τ_{k} \in R$ can be considered the weight that is given to a specific non-probability sample ${\tilde{s}}_{k}$ in the estimation of the model in Equation (1).

One way to solve the problem in Equation (5) is to express it as a regression of $\hat{y}$ on ${\hat{\tilde{y}}}_{k}$ , $k = 1, \dots, K$ , where $\hat{y} = μ (X \hat{β})$ and ${\hat{\tilde{y}}}_{k} = μ (X {\tilde{β}}_{k})$ , with $X$ as a $n_{s} \times P$ matrix containing the covariates of the probability sample $s$ . Based on Equation (5) we regress $\hat{y}$ (the fitted values from the probability sample) on ${\hat{\tilde{y}}}_{k}$ , so that any differences in performance are due to differences between the non-probability samples, rather than issues with covariate selection or functional form.

To solve the optimization problem in Equation (5), LASSO regression is used, which is a modified version of ordinary least squares (OLS) regression, which attempts to select a set of coefficients $τ \in R^{K \times 1}$ that minimizes both the errors in prediction made by the linear model and the number of estimated coefficients greater than 0. This method used a distance function that can be written as:

mi n_{τ \in R^{K \times 1}} {\frac{1}{n_{s}} ∥ \hat{y} - \hat{\tilde{Y}} τ ∥_{2}^{2} + λ ∥ τ ∥_{1}}

(6)

where $\hat{\tilde{Y}}$ is a $n_{s} \times K$ matrix composed of the column vectors ${\hat{\tilde{y}}}_{k}$ , $k = 1, \dots, K$ . $\frac{1}{n_{s}} | | \hat{y} - \hat{\tilde{Y}} τ | |_{2}^{2}$ is the traditional OLS distance minimization, and $λ | | τ | |_{1}$ is the new component which punishes the vendor selection model’s fit as more non-probability samples are given a weight greater than zero. $λ \in R$ is a tuning parameter the researcher selects (usually via k-fold cross-validation [see James et al. 2013, 181–3, 255]) that describes how much to punish the model for allowing the k-th element of $τ$ , $τ_{k}$ , to be greater than 0. It should be noted that the order of columns in $\hat{\tilde{Y}}$ , which we denote $[k]$ for ${\hat{\tilde{y}}}_{k}$ , can have an effect on the estimation of $τ$ , that is, for any given order $[k]$ , $k = 1, \dots, K$ , $τ_{k}$ might change and even be zero (Meinshausen and Bühlmann 2006). However, order $[k]$ , $k = 1, \dots, K$ , should have no effect on the prediction of $\hat{y}$ .

LASSO is an appropriate selection method (instead of OLS) due to its ability to determine which datasets are similar enough that they drop out, that is, have a $τ_{k} = 0$ , implying that the variation between their estimation errors is small. If models estimated on each dataset fail to drop out, then either the variation between estimation error is reasonably large, or different sample vendors have a strong effect on what data is collected, or both.

As LASSO drops variables that don’t improve model fit (depending on a tuning parameter, $λ$ , which adjusts the penalty for including additional variables), its variable selection acts as a dataset selection method when applied to the predictions of $\hat{y}$ from the non-probability samples $k = 1, \dots, K$ . LASSO is also capable of using highly collinear data, which is likely in the present context, as the same model fitted on different non-probability datasets could result in similar predictions. As outlined in Subsection 2.1, it is unreasonable to choose any particular selection of non-probability sample vendors (without strong a priori information) and the selection may depend on the outcome variable of interest.

2.4. Averaging Estimators

Here we consider weighted and unweighted averages of estimated parameters to estimate the model of interest. For this we define a general class of estimators that average parameter estimates from different non-probability samples:

{\bar{\tilde{β}}}_{avg} = \sum_{k = 1}^{K} w_{k} {\hat{\tilde{β}}}_{k}

(7)

where $w_{k}$ is the weight associated with the $k$ -th vendor. For ${\hat{\tilde{β}}}_{k}$ we use the OLS estimator for model parameter $β$ in Equation (1) for continuous $y$ and iteratively reweighted least squares (IWLS) for binary and positive discrete $y$ based on non-probability sample ${\tilde{s}}_{k}$ . Four estimators are defined within this broader class by specifying weights $w_{k}$ for 7.

w_{k} = {\begin{matrix} \frac{I (k \in A)}{| A |} & if avg = sv g_{A} [simple average over subset A] \\ \frac{c_{k} I (k \in A)}{\sum_{k = 1}^{K} I (k \in A) c_{k}} & if avg = cvg [cost weighted average] \\ {\hat{τ}}_{k} & if avg = las [lasso weighted average] \\ \frac{I ({\hat{τ}}_{k} \neq 0)}{\sum_{k = 1}^{K} I ({\hat{τ}}_{k} \neq 0)} & if avg = lse [lasso selected average] \end{matrix}

where ${\hat{τ}}_{k}$ is the $k$ -th element of $\hat{τ}$ , the solution to the selection model in Equation (6), $I (.)$ is an indicator function assuming a value of $1$ if the statement it contains is true and $0$ otherwise, $A$ is defined as in 4, and $c_{k}$ is the cost for contracting the $k$ -th vendor. We consider two estimators that are based on averaging estimates from multiple non-probability samples, without using any additional information on their estimation errors with regard to the model of interest.

The first estimator, ${\bar{\tilde{β}}}_{sv g_{A}}$ , is the arithmetic mean of the estimated model parameter ${\tilde{β}}_{k}$ from the full set $A$ of non-probability samples. The second estimator, ${\bar{\tilde{β}}}_{cvg}$ , weights the estimates ${\tilde{β}}_{k}$ for all non-probability samples by the relative cost of contracting the respective vendors. Cost weighting is used as a simple heuristic with the rationale being that vendors which are more expensive are presumed to produce data of higher quality (i.e., samples that are more similar to the benchmark probability sample) and receive higher weight. In practice, these are the only two estimators that a researcher could readily choose from.

The other two estimators use additional information from a probability sample to weight (or alternatively, select) estimates from the non-probability samples. The weights of the estimator ${\bar{\tilde{β}}}_{las}$ are obtained by fitting a LASSO regression, as described in Subsection 2.3, using the predictions from the non-probability sample estimates as independent variables and the predictions from the probability sample estimated model as the dependent variable. Note that weights ${\hat{τ}}_{k} k = 1, \dots, K$ can also be zero or negative. Estimator ${\bar{\tilde{β}}}_{lse}$ is also a simple average, but based on an informed selection of vendor samples, including only estimates from those non-probability samples that were selected by the LASSO method, that is, that were included in the solution to Equation (6) with nonzero parameters.

It is important to note that we use these latter two estimators only to show whether there exists a smaller subset of vendor samples which, when averaged, reduces estimation error further than simply averaging across all vendor samples. In practice, these estimators would not be feasible given the usual absence of a reference probability sample.

Our main interest lies in the distance between probability and non-probability sample prediction error of the estimated models of interest. To evaluate these quality indicators, we compare the following measures across the estimates.

To assess the prediction precision of the estimator ${\bar{\tilde{β}}}_{avg}$ we use the relative root mean squared error (RRMSE) given by:

RRMS E_{avg} = \sqrt{\frac{\sum_{i \in s} {({\hat{y}}_{i} - μ (x_{i}^{⊤} {\bar{\tilde{β}}}_{avg}))}^{2} \frac{1}{| s |}}{{(\frac{1}{| s |} \sum_{i \in s} {\hat{y}}_{i})}^{2}}}

(8)

We opt to use $\hat{y}$ over $y$ (the actual measurement of the variable of interest in the probability sample) when calculating RRMSE in order to ensure that differences in performance are due to which non-probability sample is being used, rather than a model misspecification (such as treating a non-linear relationship as a linear one). In other words, we want differences in performance to be due to correctly capturing regions of the probability sample, not a quirk of the model’s functional form. As RRMSE averages the differences in model predictions, it cannot identify for what covariate values a model is more or less accurate. It does, however, provide a convenient method for comparing model performance.

3. Data

To assess the variability in vendor-specific sampling error and demonstrate the different averaging procedures in a real-world application, we use one probability and eight non-probability Web surveys which implemented identical questionnaires during overlapping time periods. The probability survey comes from the German Internet Panel (GIP), which is a population-based panel survey representative of the German general population. The GIP is an ongoing, household panel survey of adults (16–75 years) residing in Germany. It is conducted by the University of Mannheim. The GIP sample is selected based on a multi-stage stratified area probability design. First, a random sample of geographic districts in Germany are drawn. Then a random sample of households are drawn and all age-eligible household members are invited to join the panel (Blom et al. 2015). Because GIP data are exclusively collected online, households without internet access and/or an internet-capable browsing device are provided with either or both to facilitate coverage of the offline population and access to the online questionnaires (Blom, Herzing, et al. 2017). The initial recruitment round was conducted in 2012 and yielded a cumulative panel registration rate of 18.5%. A second recruitment round was conducted in 2014 and yielded a cumulative panel registration rate of 21.0%. After joining the panel, panelists are invited every two months to login and complete a Web survey of about twenty to twenty-five minutes containing a range of question modules on social, political, and economic issues. We use the questionnaire module administered in the March 2015 wave of the GIP, in which 68.7% of panelists (or 3,426 out of 4,989) completed the Web survey. Survey weights are also available for the GIP. These were computed by raking to population benchmarks on the following variables: marital status, household size, age, and education.

The eight non-probability Web surveys were commissioned as part of a separate methodological project assessing the accuracy and response quality of non-probability surveys conducted by the GIP team (Blom, Ackermann-Piek, et al. 2017; Cornesse and Blom 2023). Each non-probability survey was conducted by a different survey vendor in response to a call for proposals. The call stipulated that vendors recruit a sample consisting of approximately 1,000 respondents that are representative of the general population of Germany aged 18 to 70 years. No explicit instructions on how representativeness was to be achieved were provided in the call, and this decision was left to the vendors’ discretion. Out of seventeen responses to the call, seven vendors met the technical and budgetary criteria of the study and were accepted. Upon learning about the aims of the methodological project, an eighth vendor contacted the GIP team and volunteered to carry out one of the surveys free of charge. A list of the eight non-probability surveys is provided in Table 6.1 in the Supplemental Appendix in the Data Section. Details of the recruitment, including quota sampling variables and costs are included. To preserve confidentiality, the names of the eight non-probability survey vendors are referred to as Survey 1, Survey 2, and so on, ordered by the total survey costs (least expensive to most expensive).

We consider ten substantive models of interest, which we index by G1, G2, $\dots, G 10$ , for the following ten outcome variables: G1: body height, G2: body weight, G3: body mass index, G4: a two-item additive index of the BIG-5 agreeableness dimension, G5: a two-item additive index of the BIG-5 openness dimension, G6: a four-item additive index of Need for Cognition dimension, G7: an 8-point indicator of political interest, G8: an 8-point indicator of political activeness, G9: a binary item of having voted in the last general election, and G10: number of traffic penalty points. Each of the above outcome variables is modeled with the following explanatory variables: age, sex, self-reported health status, marital status, education, being regularly employed full- or part-time, homeowner, and daily internet use. The exact scales of the all variables can be found in the Supplemental Appendix in the Data Section as well as the parametrization used for all ten models.

For the non-probability sample we used the R glm function from the stats package (R Core Team 2024) and for probability sample the svyglm function from the R survey package (Lumley 2020) to fit all ten models. For the non-probability samples no survey weights were used in the estimation and for the probability sample the survey weights for the GIP sample were used.

4. Results

4.1. Comparison of Sample Quality and Simple Averaging

Table 1 shows the $RRMS E_{sv g_{k}}$ of the predictions made using the parameter estimates for each non-probability sample. We can observe that the variation for the difference between predicted values from the probability sample and the non-probability samples is higher between the different models (rows of Table 1) than between the different vendors (columns of Table 1). In particular, there are five models which show a relatively high RRMSE, as compared to the other models, which are G8, G10, G7, and, G9. However, the variation of the RRMSE between vendors for most models is still relatively high. With the exception of model G1, the variation in RRMSE between the best and worst vendors is never less than double. For model G10, the ratio between highest and lowest RRMSE is greater than 4. These results indicate that the different non-probability samples have no common error distribution, but in fact produce very different errors with regard to the considered models. However, we see that for some vendors their RRMSEs are consistently higher for the different models than for others. For example we find that vendor 5 is among the vendors with the three highest RRMSEs for all models and vendor 3 is among the three vendors with the lowest RRMSE.

Table 1.

$RRMS E_{sv g_{k}}$ Individual Non-Probability Samples.

Model/vendor	1	2	3	4	5	6	7	8	Mean
G1	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01
G2	0.04	0.03	0.03	0.04	0.05	0.06	0.03	0.03	0.04
G3	0.07	0.05	0.03	0.08	0.08	0.08	0.05	0.03	0.06
G4	0.02	0.03	0.03	0.02	0.03	0.03	0.04	0.04	0.03
G5	0.03	0.03	0.03	0.04	0.04	0.04	0.03	0.04	0.03
G6	0.06	0.04	0.04	0.03	0.04	0.03	0.02	0.03	0.04
G7	0.13	0.15	0.12	0.16	0.18	0.13	0.13	0.14	0.14
G8	0.12	0.21	0.13	0.20	0.33	0.25	0.21	0.20	0.21
G9	0.06	0.09	0.09	0.10	0.12	0.15	0.06	0.12	0.10
G10	0.09	0.17	0.08	0.26	0.28	0.25	0.06	0.21	0.18
Mean	0.06	0.08	0.06	0.09	0.11	0.10	0.06	0.08	0.08

The deviations that some non-probability sample estimates display with respect to the probability sample data, hint at a selection bias from the estimation strategy based on the non-probability samples, as described in Equation (3). With regard to the first research question, we may conclude that researchers should not rely on a single sample from an arbitrarily selected vendor and that model estimates are not reproducible across the different vendors. Thus, there is a clear argument for averaging estimates from different vendors, to hedge the maximum regret (or error) and exploit the variation across vendors. The interested reader can find in the Supplemental Appendix Tables A1 to A10, one for each model, containing the parameter estimates obtained from each of the non-probability samples, and additionally the parameter estimates from the GIP benchmark probability sample. There it can be explored, in more detail, how different the parameter estimates from the different non-probability samples are between each other and compared to the reference GIP sample estimates.

Figure 1 shows the distribution of $RRMS E_{sv g_{A}}$ for all possible combinations of vendors, that is, a simple averaging estimator is used: ${\bar{\tilde{β}}}_{sv g_{A}}$ , with $A \in P ({1, \dots, k})$ , $k = 1, \dots, K$ , where as stated earlier, $P (.)$ is the power set (set of all possible sets) for any given set. Hence, there are 255 possible combinations of different vendors. The x-axis of Figure 1 corresponds to the number of vendors for which all possible combinations have been used in the estimation. Figure 1 clearly demonstrates that averaging multiple non-probability samples will reduce maximum error, compared to using an individual sample. The models with the highest average $RRMS E_{sv g_{A}}$ are models G8, G10, G7, G9, G3. These results may be compared to the previous results presented in Table 1, which show the maximal and minimal $RRMS E_{sv g_{A}}$ that is possible for any combination of vendors. Apart from model G7, the mean of $RRMS E_{sv g_{A}}$ reduces with the size of $A$ . This shows that increasing the size of $A$ for estimator ${\bar{\tilde{β}}}_{sv g_{A}}$ , in general, reduces the average $RRMS E_{sv g_{A}}$ for equal sized $A$ . While $RRMS E_{sv g_{K}}$ is not always the lowest when all possible combinations are used, we do observe for most models (apart from G7 and G10) a somewhat monotone convergence of the average $RRMS E_{sv g_{A}}$ for equal sized $A$ . Thus, increasing the size of $A$ decreases the average RRMSE for all models in general, except for model G7.

Figure 1.

$RRMS E_{sv g_{A}}$ .

Since summary statistics, like means, are also very important for many practitioners, we included Tables 6.2 and 6.3 in the Supplemental Appendix in order to show the means of our variables of interest and covariates for all eight non-probability samples and the GIP. To evaluate how the mean of our variables of interest differs between the non-probability samples and the probability sample we compute the conditional mean of our variables of interest for each non-probability sample using the model parameter estimates from the non-probability sample and the mean of the covariates from the probability sample. Then we compute the relative difference between the mean of the variables of interest from the probability sample and the conditional means. We do this analysis two times. One time using all of the above listed covariates and a second time using only age, sex, and education as covariates, to see what the effect of a more restricted and conventional set of control variable would be.

The results of this analysis can be found in the Supplemental Appendix in Tables 6.6 and 6.5 for the analysis using all covariates and the restricted set of covariates respectively. As a comparison Table 6.4 is also included which shows the relative distance between the unconditional means of the non-probability samples and the probability sample. The results of the analysis are mixed. Out of the eighty means considered we see that only for forty-two the relative difference for the unconditional means is higher than that for the unconditional mean using all covariates. When we consider the restricted set of covariates this number only increases to 43. Overall we cannot observe large changes in the relative differences if we decrease the number of covariates. We see that across the variables of interest, the cumulative relative difference is higher for the conditional mean than for unconditional means for 3 out of the 8 vendors. For the restricted set of covariates, this increases to 4. These findings suggest that the non-probability sampling designs are informative, despite controlling for a range of socio-demographic covariates.

A priori there is no indication of which vendors to avoid. Given the apparent informative selection of the non-probability samples we want now turn to evaluating the four averaging estimators, and identifying whether a more informed selection of vendors exists that, when averaged, minimizes error relative to the reference GIP probability sample.

4.2. Comparison of Informed and Simple Averaging Estimators

To address the second research question, we show results obtained from several alternative averaging estimators, including the two LASSO averaging estimators ${\bar{\tilde{β}}}_{las}$ and ${\bar{\tilde{β}}}_{lse}$ . The penalty term $λ$ of the LASSO regression in Equation (6) is selected by cross-validation using the R package glmnet (Friedman et al. 2010; Simon et al. 2011). As a reference, the two basic estimators based on simple averaging ${\bar{\tilde{β}}}_{sv g_{A}}$ and cost averaging ${\bar{\tilde{β}}}_{cvg}$ are included. For the cost averaging estimator ${\bar{\tilde{β}}}_{cvg}$ , a cost measure $c_{k}$ is used which is the price of contracting the individual vendors $k$ , as listed in Table 6.1. For vendor 1, which was carried out free of charge, we imputed the average cost of the other seven vendors as an estimate for its actual cost.

Figure 2 shows the RRMSE for all four estimators, that is, $RRMS E_{avg}$ , $avg = las, lse, sv g_{A}, cvg$ , for all models of interest. The step function in the background of each panel shows the Empirical Cumulative Distribution Function (ECDF) for the distribution of $RRMS E_{sv g_{A}}$ for all possible combinations of non-probability samples.

Figure 2.

Root Relative Mean Square Error: $RRMS E_{avg}$ .

Figure 2 shows that the various estimators (e.g., LASSO weights, LASSO select, etc.) perform mostly as well as or better than the simple mean for around 50% of all possible combinations of vendors. The LASSO weighted estimator ${\bar{\tilde{β}}}_{las}$ , represents a considerable improvement in estimation precision, compared to simple averaging, that is, over ${\bar{\tilde{β}}}_{sv g_{A}}$ . It is also clear that ${\bar{\tilde{β}}}_{cvg}$ , the heuristic cost weighting scheme does not generally present an improvement over simple averaging. The RRMSE of the cost weighted estimator ${\bar{\tilde{β}}}_{cvg}$ is comparable to that of the simple average for all models. The LASSO selected estimator ${\bar{\tilde{β}}}_{lse}$ has a lower or comparable RRMSE, except for model G6, where its RRMSE is considerably higher than that of the simple average estimator ${\bar{\tilde{β}}}_{sv g_{k}}$ . The LASSO weighted estimator ${\bar{\tilde{β}}}_{las}$ also displays, for all models, a considerably higher precision than all other estimators for all models.

Table 2 shows the ratios $RRMS E_{avg} / RRMS E_{sv g_{K}}$ , $avg = las, lse, cvg$ , which clearly highlight the considerable reduction in the RRMSE when using the LASSO weighted estimator, compared to the simple average. The LASSO weighted estimator ${\bar{\tilde{β}}}_{las}$ , as expected, reduces the RRMSE considerably.

Table 2.

RRMSEs in Relation to Simple Average.

Model	cvg	lse	las
G1	0.99	0.85	0.49
G2	1.01	1.00	0.21
G3	1.01	1.00	0.16
G4	1.03	0.90	0.62
G5	1.04	0.95	0.41
G6	0.96	1.47	0.63
G7	1.00	0.97	0.09
G8	1.01	0.96	0.19
G9	1.02	0.82	0.46
G10	1.03	1.12	0.16

The LASSO selected estimator ${\bar{\tilde{β}}}_{lse}$ shows also a lower RRMSE than the simple average for some models, but not nearly as much as the LASSO weighted estimator. For models G2 and G3 there is no difference to the simple average, that is, all vendors are selected. For models G6 and G10 the RRMSE of the simple average estimator is even lower that for the LASSO selected estimator (47% and 12% respectively). This shows that the LASSO selected weights used for the LASSO weighted estimator deviate considerably from equal weighting and are for some models even negative. For estimator ${\bar{\tilde{β}}}_{las}$ the reduction in the RRMSE ranges from almost 90% for model G7 to around 37% for model G6.

The RRMSE results for the averaging estimators clearly show that the LASSO informed selection and weighting of vendors can result in considerable improvement in estimation precision. Considering that the two LASSO estimators, ${\bar{\tilde{β}}}_{las}$ and ${\bar{\tilde{β}}}_{lse}$ , do not always include all non-probability samples, this means that improving the precision of estimates doesn’t require increasing the number of vendors, or contracting all of them or even the most expensive ones, but to weight the right ones (e.g., those that contain information not present in the other samples). Hence, in the next paragraph we investigate which non-probability samples are selected for the substantive models by the LASSO estimators ${\bar{\tilde{β}}}_{las}$ or ${\bar{\tilde{β}}}_{lse}$ and the weights of the LASSO weighted estimator ${\bar{\tilde{β}}}_{las}$ .

For the interested reader we also evaluate the deviation of non-probability sample model coefficient estimates from the reference probability sample estimates. The results and discussion of this analysis and can be found in Supplemental Appendix 6.4.

Table 3 shows, for each model, which non-probability samples are selected for the LASSO estimators ${\bar{\tilde{β}}}_{las}$ and ${\bar{\tilde{β}}}_{lse}$ . To account for any possible ordering effect of supplying the eight non-probability samples in the LASSO estimation in Equation (6), their order is randomly permuted and the routine is re-run 1,000 times. The values in Table 3 are the average selection frequencies for each non-probability sample for the estimation of each model. Table 3 shows what effect permuting the order of the non-probability samples in Equation (6) has. What we see is that for almost all samples their selection frequency, for all models, is not close to 5, indicating that their selection is far from arbitrary. On the contrary, there are clear patterns of selection. The row sums of Table 3 show that non-probability sample vendors 1 and 7 are selected most frequently, meaning that they are almost always selected, regardless of the order of the non-probability samples. In contrast, the sample from vendor 4 is selected only between $3$ and $4$ times, on average, across the ten different models. This suggests that, on average, the sample from vendor 4 is not very suitable for estimating the models of interest, at least in combination with the samples from the other seven vendors.

Table 3.

LASSO Selection Frequency of Non-Probability Samples.

Vendor	G1	G2	G3	G4	G5	G6	G7	G8	G9	G10	Freq
1	1.00	1.00	1.00	1.00	0.81	1.00	1.00	1.00	1.00	0.91	9.72
2	0.00	0.76	0.99	1.00	0.16	0.07	1.00	1.00	0.14	1.00	6.12
3	0.00	1.00	1.00	0.05	1.00	0.07	1.00	0.01	1.00	1.00	6.12
4	0.00	0.76	0.99	0.00	0.04	0.00	1.00	0.06	0.46	0.01	3.32
5	0.99	0.76	1.00	0.22	0.97	0.00	0.25	0.10	0.26	0.92	5.46
6	0.00	0.76	1.00	0.14	0.90	0.00	1.00	0.72	0.26	1.00	5.78
7	1.00	0.76	0.99	0.99	1.00	1.00	1.00	0.72	1.00	1.00	9.46
8	1.00	1.00	1.00	0.05	0.00	0.07	1.00	1.00	1.00	0.91	7.03
Freq	3.99	6.81	7.96	3.45	4.88	2.20	7.25	4.60	5.12	6.74	53.02

As can be seen in Figure 1, Models G1, G4, G5, and G6 generally have lower RRMSEs when estimated on the various non-probability vendors than models G2, G3, G5, G7, G8, and G10. This may account for the results observed in Table 3, where the LASSO procedure selects fewer vendors when estimating each model. If each non-probability dataset is similarly (in)accurate (having a consistent and/or low RRMSE), providing little additional information useful for prediction, LASSO would select fewer datasets in order to reduce the number of parameters in its final model.

The patterns of the row and column sums of Table 3 suggest two things. First, the row sums indicate that different vendors produce samples that have very different errors for each model, and that there are vendors that are, on average, less suitable to estimate the models. Otherwise LASSO would have either: (1) selected the same vendors (as their non-probability samples provide important information), or (2) selected competing vendors with roughly equal probability, as the non-probability samples would improve the predictive model similarly. An example of the second possibility can be seen in Table 3, where for model G2, vendors 2, 4, 5, 6, and 7 are selected equally often, suggesting that they are somewhat interchangeable. In contrast, vendors 1, 3, and 8 are always selected for model G2, as they provide information not contained in the other vendors’ samples. Second, the differences in the column sums show that the degree of selection bias in the non-probability samples differs depending on the outcome variable of interest.

To further investigate the selection bias of the eight sample vendors, we examine the weights of estimator ${\bar{\tilde{β}}}_{las}$ , that is, $τ$ the parameters of the LASSO regression as defined in Equation (6). Table 4 shows the weights ${\hat{τ}}_{k}$ for $k = 1, \dots, 8$ and all models of interest, averaged over the 1,000 calculations of ${\bar{\tilde{β}}}_{las}$ based on the same random permutations of the order of the non-probability samples used for the results in Table 3.

Table 4.

LASSO Weights of Non-Probability Samples.

Vendor	G1	G2	G3	G4	G5	G6	G7	G8	G9	G10
1	0.45	0.29	0.24	0.54	0.05	0.19	0.71	0.79	0.12	0.18
2	0.00	−0.44	−0.26	0.25	0.00	−0.01	0.20	0.07	0.00	0.68
3	0.00	1.00	0.99	−0.00	0.75	−0.02	−0.20	−0.00	0.33	0.26
4	0.00	0.68	−0.09	−0.00	0.00	0.00	−0.67	−0.00	0.00	−0.00
5	0.00	−0.19	−0.25	−0.03	−0.64	0.00	0.00	−0.01	0.01	0.14
6	0.00	−0.40	0.01	0.00	0.13	0.00	0.24	−0.10	−0.04	−0.73
7	0.32	−0.33	−0.41	0.04	0.45	0.74	0.70	−0.09	0.60	0.93
8	0.17	0.22	0.50	0.00	−0.00	0.01	0.10	0.37	0.21	0.22

Table 4 shows that the weights can be both positive and negative, indicating that for some vendor samples there is a negative relationship between their measurement of the variable of interest and that of the GIP probability reference sample. We find a strong negative relationship between the probability sample and non-probability samples 4 and 5 for models G7 and G5, respectively. For all models, except G1, there is a mix of negative and positive weights for all vendors, except for vendor 1. Vendor 8 also has only positive weights, except for model G5, where its weight and selection frequency are comparably low.

Now, to see the relationship between the inclusion of a non-probability sample and its weight, we plot the selection frequencies of Table 3 against the average weights of the Table 4, which is displayed in Figure 3 for each model separately.

Figure 3.

LASSO weights and LASSO selection frequency.

There are two observations that can be made. First, negative weights with high nominal values only occur for models G2, G3, G7, and G10, where most vendor samples are selected for inclusion in the LASSO estimators ${\bar{\tilde{β}}}_{lse}$ and ${\bar{\tilde{β}}}_{las}$ . Also, for the other models where fewer non-probability samples are selected we find that the weights are mostly positive. This second observation is the desired outcome of using an informed selection of vendors. Namely, if there is a subset of vendors that is particularly suitable to estimate the model of interest, then we would like those vendors to get rewarded by receiving positive weights and not strong negative weights. A strong negative weight would imply inclusion of the sample despite its strong selection bias, because in combination with other vendors it can improve the estimation of the model of interest, which is what was shown for models G2, G3, G7, and G10.

To conclude the presentation of the results, we examine the relationship between the RRMSE of the simple average ${\bar{\tilde{β}}}_{sv g_{k}}$ and the absolute value of the weight of the LASSO weighted estimator ${\bar{\tilde{β}}}_{las}$ , ${\hat{τ}}_{k}$ , for each vendor sample. Figure 4 shows this relationship for each model separately. To give some visual assistance in assessing the association between $| {\hat{τ}}_{k} |$ and $RRMS E_{sv g_{k}}$ , a regression line is included for each model in the figure. This regression line indicates a clear negative relationship between the two statistics. This is also a desired result of the LASSO weighted estimator ${\bar{\tilde{β}}}_{las}$ , that is, that vendors whose model estimates show a higher RRMSE are identified as less suitable or important and receive lower weight in estimating the model of interest.

Figure 4.

$RRMS E_{las}$ and LASSO selection frequency.

5. Discussion

This article addressed three primary research questions. First, we analyzed if non-probability samples collected from different vendors produce similar or consistent errors, and if not, is it possible to reduce worst-case estimation error in substantive survey estimates by averaging across multiple vendor samples? We showed that we were not able to control for the selection process of the non-probability samples by simply conditioning on a range of socio-demographic covariates, that are also commonly used as weighing variables in surveys.

We also showed that there is clearly a great deal of error variation from vendor to vendor, as well as from substantive model to substantive model, suggesting that relying on any single non-probability sample vendor is risky in terms of ensuring high-quality and reproducible parameter estimates.

We then addressed the second research question: How can we weight the results, obtained from the different non-probability samples, to reduce estimation errors? To this end we evaluate multiple averaging strategies for combining different non-probability samples to improve parameter estimates from substantive models. We showed that there is clearly a benefit to simply averaging estimates across the full collection of vendor samples as this reduces the maximum error (or regret), compared to using any single vendor sample. This is an important result given that it is both immediately usable by researchers, and judging by common practice, is unknown. Most researchers typically rely on only one vendor sample for their analysis and do not consider spreading their sample across multiple vendors. It also suggests that when collecting non-probability samples, it is preferable to accept a smaller sample size but obtain several different vendor’s samples, than a larger sample from a single vendor. The exact trade-off between sample size and the number of different vendors remains a subject for further research.

Finally, we demonstrated via LASSO weighting procedures that the accuracy of model estimates can be greatly improved by averaging a well-chosen subset of vendors. This is direct evidence that simply increasing the number of vendors used in the averaging procedure will not necessarily optimize estimation (despite increasing cost), and that some sample providers are more suitable than others for estimation depending on the outcome/model of interest; in fact, certain vendors did not provide any useful information during the estimation of some models and simply dropped out during the LASSO selection step (i.e., were assigned a weight of zero). In other words, some vendors produced samples that were redundant, making errors similar to those made by some other vendors. This is a very useful finding. If vendors make similar errors, then there is no advantage to commissioning samples from multiple vendors (averaging would only reduce independent sources of error, not common sources). As we found that some vendors produce samples suffering from different errors, while others result in similar errors, our study points to a potentially worthwhile subject of research: Are the errors produced by vendors consistent over time? If so, vendors could be evaluated, and researchers could then use these evaluations to more efficiently guide their selections of vendors, possibly reducing the number of vendors that would need to be commissioned.

From a practical point of view a benchmark statistic with an estimable sampling error, obtained from a probability sample, is still required. However, there are Scientific Use Files of probability samples based surveys available that can serve this role, even if their data is not measured at the same time as that of the non-probability samples. Non-probability samples are in fact often weighted by estimates obtained from probability samples.

If a researcher cannot afford a probability sample and must use non-probability sampling they should minimize their risk of error by distributing their planned sample across multiple vendors and average the point estimates obtained from each dataset, rather than contracting a single vendor sample. Further, when contracting multiple vendors, allocating a larger share of the total sample to one particular vendor (at the cost of allocating smaller sample shares to the other vendors) will usually not result in an improved RRMSE, as we do not know in advance which vendor sample will result in the lowest estimation error (i.e., cost is not a reliable indicator of quality). There would always be a risk that one’s preference for a particular vendor would harm the accuracy of the estimated model. As increasing the sample size of a non-probability sample does not automatically result in greater power (Meng 2018), preferring vendors that can provide more observations at a given cost is similarly an unviable strategy. Also, since we lack a concept to estimate sampling variances under a non-probability sampling design (within our framework) it is difficult to assess how different sample sizes of the non-probability sample would affect the standard error of estimates based on them. Thus is would also not be possible to put some preference on a particular vendor based on design effect, for example.

In conclusion, this study finds significant variability in the errors produced by different non-probability sample vendors. Not only did the errors vary across samples, but also across substantive outcomes—a vendor sample that performed very well for estimating one outcome often performed worse for other outcomes. This context-dependent performance is consistent with differing selection effects for each vendor, and is also consistent with our hypothesis of informative sampling, as stated in Subsection 2.1. Simply averaging point estimates derived from multiple vendor samples is beneficial as it will always hedge against the risk of a worst-case scenario with regard to prediction or estimation accuracy, provided we have no additional information suggesting that one vendor is likely to perform better than another. However, we showed that substantial improvements in accuracy can be achieved by selecting and averaging an optimally-chosen subset of vendors for a given model of interest. It is also clear that some vendors produce samples with redundant (common) errors, while others exhibit beneficially different errors. An alternative to averaging could be the pooling of different vendor samples for estimation, however the problem of selecting and/or weighting the different non-probability samples would remain the same. In addition, unequal sample sizes of the non-probability samples would add the additional problem of having to decide how many and which elements from each non-probability sample should be used in the estimation.

This suggests serious shortcomings with non-probability samples, as their compositions seem to differ dramatically from each other. In the absence of further methodological developments (such as indicators of vendor quality), probability samples should be preferred whenever possible.

Supplemental Material

sj-docx-1-jof-10.1177_0282423X241312775 – Supplemental material for Averaging Non-Probability Online Surveys to Avoid Maximal Estimation Error

Supplemental material, sj-docx-1-jof-10.1177_0282423X241312775 for Averaging Non-Probability Online Surveys to Avoid Maximal Estimation Error by Alexander Murray-Watters, Stefan Zins, Joseph W. Sakshaug and Carina Cornesse in Journal of Official Statistics

Footnotes

Acknowledgements

The authors gratefully acknowledge the support of the SFB 884, in particular projects A8 and Z1.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper used data from the German Internet Panel (GIP; wave 16). At the time of data collection the GIP was funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) through the Collaborative Research Center (SFB) 884 “Political Economy of Reforms” (SFB 884, Project-ID 139943784).

ORCID iD

Stefan Zins

Supplemental Material

Supplemental material for this article is available online.

Received: June 30, 2023

Accepted: December 19, 2024

References

Alcántara

Cosenzo

L. A. G.

Fan

Doyle

D. M.

Shaffer

J. A.

2017. “Anxiety Sensitivity and Racial Differences in Sleep Duration: Results from a National Survey of Adults with Cardiovascular Disease.” Journal of Anxiety Disorders 48: 102–8. DOI: https://doi.org/10.1016/j.janxdis.2016.10.002.

Bernauer

Gampfer

2015. “How Robust Is Public Support for Unilateral Climate Policy?” Environmental Science & Policy 54: 316–30. DOI: https://doi.org/10.1016/j.envsci.2015.07.010.

Blom

A. G.

Ackermann-Piek

Helmschrott

S. C.

Cornesse

Sakshaug

J. W.

2017. “The Representativeness of Online Panels: Coverage, Sampling and Weighting.” General Online Research Conference, Berlin, Germany, March 15–17.

Blom

A. G.

Bosnjak

Cornilleau

Cousteaux

A.-S.

Das

Douhou

Krieger

2016. “A Comparison of Four Probability-Based Online and Mixed-Mode Panels in Europe.” Social Science Computer Review 34 (1): 8–25. DOI: https://doi.org/10.1177/0894439315574825.

Blom

A. G.

Gathmann

Krieger

2015. “Setting Up an Online Panel Representative of the General Population: The German Internet Panel.” Field Methods 27 (4): 391–408. DOI: https://doi.org/10.1177/1525822X15574494.

Blom

A. G.

Herzing

J. M. E.

Cornesse

Sakshaug

J. W.

Krieger

Bossert

2017. “Does the Recruitment of Offline Households Increase the Sample Representativeness of Probability-Based Online Panels? Evidence from the German Internet Panel.” Social Science Computer Review 35 (4): 498–520. DOI: https://doi.org/10.1177/0894439316651.

Callegaro

Baker

Bethlehem

Goritz

A. S.

Krosnick

J. A.

Lavrakas

P. J.

2014. “Online Panel Research: History, Concepts, Applications and a Look at the Future.” In Online Panel Research: A Data Quality Perspective, edited by M.

Callegaro

Baker

Bethlehem

Goritz

A. S.

Krosnick

J. A.

Lavrakas

P. J.

Hoboken, NJ: Wiley.

Cornesse

Blom

A. G.

2023. “Response Quality in Nonprobability and Probability-Based Online Panels.” Sociological Methods & Research 52 (2): 879–908. DOI: https://doi.org/10.1177/0049124120914940.

Cornesse

Blom

A. G.

Dutwin

Krosnick

J. A.

De Leeuw

E. D.

Legleye

Pasek

, et al. 2020. “A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research.” Journal of Survey Statistics and Methodology 8 (1): 4–36. DOI: https://doi.org/10.1093/jssam/smz041.

10.

Dockrell

Morrison

Bauld

McNeill

2013. “E-Cigarettes: Prevalence and Attitudes in Great Britain.” Nicotine & Tobacco Research 15 (10): 1737–44. DOI: https://doi.org/10.1093/ntr/ntt057.

11.

Elliott

M. R.

Valliant

2017. “Inference for Nonprobability Samples.” Statistical Science 32 (2): 249–64. DOI: https://doi.org/10.1214/16-STS598.

12.

Friedman

Hastie

Tibshirani

2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. DOI: https://doi.org/10.18637/jss.v033.i01.

13.

Gabler

Stenger

Mannheim

2000. “Minimax Strategies in Survey Sampling.” Journal of Statistical Planning and Inference 90 (2): 305–321. DOI: https://doi.org/10.1016/S0378-3758(00)00117-8.

14.

Gathergood

Wylie

2018. “Why Are Some Households so Poorly Insured?” Journal of Economic Behavior & Organization 156: 1–12. DOI: https://doi.org/10.1016/j.jebo.2018.08.006.

15.

Hodges

J. L.

Jr. Lehmann

E. L.

2011. “Some Problems in Minimax Point Estimation.” In Selected Works of EL Lehmann, edited by J.

Rojo

. New York: Springer.

16.

Inada

1984. “A Minimax Regret Estimator of a Normal Mean After Preliminary Test.” Annals of the Institute of Statistical Mathematics 36: 207–215. DOI: https://doi.org/10.1007/BF02481965.

17.

James

Witten

Hastie

Tibshirani

2013. An Introduction to Statistical Learning. New York: Springer.

18.

Kennedy

Mercer

Keeter

Hatley

McGeeney

Gimenez

2016. “Evaluating Online Nonprobability Surveys.” Pew Research Center. http://www.pewresearch.org/2016/05/02/evaluating-online-nonprobability-surveys/ (accessed September, 2016).

19.

Lumley

2020. “Package ‘Survey’.” 2014-04-06. http://faculty.washington.edu/tlumley/survey.

20.

MacInnis

Krosnick

J. A.

A. S.

Cho

M.-J.

2018. “The Accuracy of Measurements with Probability and Nonprobability Survey Samples: Replication and Extension.” Public Opinion Quarterly 82 (4): 707–44. DOI: https://doi.org/10.1093/poq/nfy038.

21.

Meinshausen

Bühlmann

2006. “High-Dimensional Graphs and Variable Selection with the Lasso.” Annals of Statistics 34 (3): 1436–62. DOI: https://doi.org/10.1214/009053606000000281.

22.

Meng

X.-L.

2018. “Statistical Paradises and Paradoxes in Big Data (I) Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” The Annals of Applied Statistics 12 (2): 685–726. DOI: https://doi.org/10.1214/18-AOAS1161SF.

23.

Pfeffermann

2011. “Modelling of Complex Survey Data: Why Model? Why Is It a Problem? How Can We Approach It?” Survey Methodology 37: 115–36. DOI: https://www150.statcan.gc.ca/n1/pub/12-001-x/2011002/article/11602-eng.pdf.

24.

Pfeffermann

Rao

C. R.

, eds. 2010. Handbook of Statistics. Vol. 29 B: Sample Surveys Sample Surveys: Inference and Analysis. Digital printing. Elsevier.

25.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/.

26.

Särndal

C.-E.

Lundström

2005. Estimation in Surveys with Nonresponse. Hoboken, NJ: Wiley.

27.

Savage

L. J.

1951. “The Theory of Statistical Decision.” Journal of the American Statistical Association 46 (253): 55–67. DOI: https://doi.org/10.2307/2280094.

28.

Silver

2019. “What Makes Our New 2020 Democratic Primary Polling Averages Different.” Technical Report. https://fivethirtyeight.com/features/what-makes-our-new-2020-democratic-primary-polling-averages-different/.

29.

Simon

Friedman

Hastie

Tibshirani

2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5): 1–13. DOI: https://doi.org/10.18637/jss.v039.i05.

30.

Stancu

Haugaard

Lähteenmäki

2016. “Determinants of Consumer Food Waste Behaviour: Two Routes to Food Waste.” Appetite 96: 7–17. DOI: https://doi.org/10.1016/j.appet.2015.08.025.

31.

Wald

1939. “Contributions to the Theory of Statistical Estimation and Testing Hypotheses.” The Annals of Mathematical Statistics 10 (4): 299–326. DOI: https://doi.org/10.1214/aoms/1177732144.

32.

Yeager

D. S.

Krosnick

J. A.

Chang

, et al. 2011. “Comparing the Accuracy of RDD Telephone Surveys and Internet Surveys Conducted with Probability and Non-Probability Samples.” Public Opinion Quarterly 75 (4): 709–47. DOI: https://doi.org/10.1093/poq/nfr020.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB