Sage Journals: Discover world-class research

Abstract

Design-based inference from probability samples is valid by construction for target parameters that are descriptive summaries of finite populations. We develop a novel approach of design-based predictive inference for finite populations, where the individual-level predictor is learned from a probability sample using any models or algorithms for incorporating the relevant auxiliary information, and the uncertainty of estimation is evaluated with respect to the known probability design while the outcome and auxiliary values for modeling are treated as constants. Unlike the existing theory of design-based model-assisted estimation for finite populations, design-based predictive inference is as well suited for individual-level prediction in addition to producing population-level estimates.

Keywords

probability sampling model-assisted estimation sample split Rao-Blackwellisation administrative register big data

1. Introduction

Throughout the twentieth century, design-based inference from finite-population probability sampling has been established as the standard approach to official statistics; see Hansen (1987), Smith (1994), Kalton (2002), Rao (2005, 2011), Beaumont and Haziza (2022) for reviews and appraisals. In this context the target parameters for estimation are descriptive, observable summaries of a given finite population, such as the population total, mean or quantiles of some specific values associated with the given population units, and the inference is characterized as descriptive or predictive (Smith, 1983; Geisser, 1993), in contrast to analytic inference of theoretical, unobservable targets such as the life expectancy (of a hypothetical cohort of individuals) or a parametric model that can be used to understand the given population.

By design-based inference from probability samples, the uncertainty of estimation is evaluated with respect to hypothetically repeated sampling from the same finite population, while all the other values involved are treated as constants associated with the given population. Design-based inference is valid by construction because it is based on the known sampling design, “whatever the unknown properties of the population” (Neyman 1934). In contrast, by model-based inference, the uncertainty of estimation is evaluated with respect to an assumed statistical model of the observations, while the available sample is typically treated as fixed; see for example, Valliant et al. (2000). Although models are necessary for analytic targets or if the available observations are not obtained by probability sampling, model-based inference may be invalid to the extent the assumed model is misspecified in respects that matter to the task at hand.

Design-based inference can be made more efficient by using auxiliary information in addition to the sampling design. This can largely recover the ‘loss of efficiency’ compared to model-based inference that uses the same auxiliary information optimally under the assumed model. For instance, calibration estimation (Deville and Särndal 1992) is a general approach that makes adjustments to the design weights with respect to the known auxiliary population totals. Or, empirical likelihood methods can yield confidence intervals that have better properties than normal approximation based on the central limit theorem (Hartley and Rao, 1968; Rao and Wu, 2010; Berger and De La Riva Torres, 2016). More relevant to our development is the model-assisted approach where an assisting model is explicitly formulated but inference remains design-based, whether or not the adopted estimator has optimal properties with respect to the assisting model. One can use linear models (Särndal et al. 1992), generalized linear models (Wu and Sitter 2001), or many other models under a unified “construction recipe” as reviewed by Breidt and Opsomer (2017).

To justify any model-assisted estimator that is not design-unbiased, it is common to seek a proof that it can be design consistent asymptotically for a hypothetical sequence of populations of increasing sizes. As Smith (1994) points out, this “asymptotic notion of consistency” is not immediately applicable to the given population as “a real entity”. In contrast, for a given population and sampling method, if $t (1), . . ., t (k)$ are unbiased estimators of the population totals $T (1), . . ., T (k)$ , then $g (t (1), . . ., t (k))$ is called “consistent” for $g (T (1), . . ., T (k))$ by Fisher (1956), in that replacing $t (j)$ by $T (j)$ gives the true target population parameter. Similarly, an interval estimator of “a collective character … of a population” is called “consistent” by Neyman (1934), if it achieves the designated level of coverage given the finite population and sampling method.

We emphasize that asymptotic consistency of a point estimator or an interval estimator is unnecessary, if the estimator is Neyman-Fisher consistent in the sense Neyman and Fisher have used the term “consistent” for finite-population inference. This will be our perspective to design-based predictive inference in this paper, which may also be called fully design-based inference in contrast to asymptotically design-consistent inference (that has been traditionally more common for model-assisted finite population estimation).

Now, we notice that design-unbiased ratio or linear regression estimators for population totals have been proposed by Hartley and Ross (1954) and Mickey (1959), which are finite-population Neyman-Fisher consistent. More recently, Sanguiao-Sande and Zhang (2021) developed a design-unbiased approach, called subsampling Rao-Blackwellisation (SRB), which allows for any assisting Machine Learning (ML) models or algorithms that have become increasingly common. The SRB approach combines three classic ideas in statistical inference, (i) model-assisted estimation for survey sampling, (ii) cross-validation for error estimation by ML methods, and (iii) the Rao-Blackwell Theorem (Rao, 1945; Blackwell, 1947) for efficiency improvement.

In this paper, we extend the SRB approach to a larger class of population estimators, which are commonly referred to as the prediction estimators, as well as the associated individual-level predictors for the out-of-sample units. Notice that, traditionally, due to the lack of a design-based prediction theory, individual outcomes must be treated as random variables for model-based prediction and the term predictor is common in this context. Although from a design-based inference perspective the term prediction estimator would seem more appropriate also at the unit level, we shall keep the term predictor at the individual level for convenience and familiarity reasons. Notice also that the two terms “unit” and “individual” are used interchangeably in this paper, such as in “statistical unit” or “individual prediction.”

It may be helpful to make some remarks immediately regarding the nature of our inference approach and its advantages compared to the more familiar model-assisted inference.

First, we consider predictive inference by definition, where the sample-based prediction estimator (using any given ML model or algorithm) aims at some out-of-sample quantity that varies with the sample, the property of which is evaluated only with respect to repeated sampling from the given population. This design-based predictive inference outlook differs from model-assisted estimation that is aimed at fixed population parameters (such as totals or means).

Next, we develop Neyman-Fisher consistent uncertainty estimators, which accommodate any given assisting model (or algorithm) and apply generally to sampling from finite populations. In contrast, asymptotically design-consistent inference may not hold in a given setting of finite population sampling, but still requires tailored asymptotic arguments to be developed for different nonparametric assisting models, such as random forest or support vector machine, which will remain a challenge as new models or algorithms emerge.

Finally, while our approach is model-assisted in the sense that the sampling design remains the inference basis despite the model introduced, it provides as well a design-theoretical basis for individual level prediction (estimation). This is another important difference to standard model-assisted estimation that is only applicable to population parameter estimation.

1.1. Prediction Estimator

Denote by $U = {1, . . . ., N}$ a given finite population that is of size $N$ . Let $y_{U} = {y_{i} : i \in U}$ be the associated values of interest. Denote by $x_{U} = {x_{i} : i \in U}$ the collection of feature vectors, where $x_{i}$ is the vector associated with each unit $i \in U$ . Given any sample of units from $U$ , denoted by $s \subset U$ , let $μ (x, s)$ be a predictor for any out-of-sample unit, say, $j \in R = U \ s$ , whose feature vector takes value $x$ , that is, $x_{j} = x$ . Notice that any function of ${y_{i} : i \in U \ s}$ is a random quantity that varies with the sample $s$ , just like $μ (x, s)$ itself as the notation emphasizes. When the target parameter is the population total $Y = \sum_{i \in U} y_{i}$ , the prediction estimator of $Y$ is given as

\hat{Y} = \sum_{i \in s} y_{i} + \sum_{j \in U \ s} μ (x_{j}, s)

(1)

We shall consider $μ (x, s)$ as the associated individual-level predictor of $y$ for any unit with given features $x$ , where $y$ is treated as a constant just like $x$ .

Notice that individual level features are required to compute the prediction estimator, Equation (1), except for example, in the case of using linear models. However, this is a requirement common to all individual-level prediction models, regardless the inference framework. There may be situations where such information is unavailable, which would limit one’s choice of models. But the ability to utilize individual level covariates is certainly not a limitation of our approach not least because, as we shall explain, design-based individual-level predictive inference can provide a valid theoretical basis for producing census-like statistical data, which fills a gap in the current literature.

Note that many design-based estimators in survey sampling can as well be given as prediction estimators, Equation (1). For example, let $x_{i} = π_{i} N / n$ for any $i \in U$ , given $π_{i} = \Pr (i \in s)$ and $n = | s |$ , such that the Horovitz-Thompson (HT) estimator (Horvitz and Thompson 1952) can be given in the form of Equation (1) with

μ_{HT} (x_{j}, s) = x_{j} β_{s} + \frac{1}{N - n} \sum_{i \in s} (x_{i} β_{s} - y_{i})

where $β_{s} = n^{- 1} \sum_{i \in s} y_{i} / x_{i}$ . Other examples include the generalized regression estimator, model-calibrated linear estimator, as well as the SRB estimator of Sanguiao-Sande and Zhang (2021).

However, we are interested in design-unbiased inference of the prediction estimator, Equation (1) generally, including when $μ$ is given by an arbitrary ML method without regard to the sampling design and may not have a wieldy expression, such as a random forest trained on ${(y_{i}, x_{i}) : i \in s}$ by ready-made software.

The theory of design-based predictive inference for population totals and individuals will be developed and illustrated in Sections 2 and 3, respectively, an illustrative application to Structural Business Survey is given in Section 4, and some final remarks are given in Section 5. However, before we get into the details of the development, let us first motivate below what design-based predictive inference can do for official statistics.

1.2. Introduction to Total Estimation

Design-based predictive inference is clearly relevant to the perennial design versus model controversy in survey sampling, as traditionally the two main strands of approaches to finite population estimation.

In the design-based approach, estimators depend on the sampling design through the sample inclusion probabilities or other known sampling probabilities. Although auxiliary information in addition to the sampling design can be incorporated by various techniques, the validity and the associated uncertainty of the resulting estimator are still based on the given sampling design. In contrast, the model-based prediction approach, frequentist or Bayesian, depends on an assumed working model, which typically ignores the sampling design and treats the available sample as fixed when it comes to the assessment of the associated uncertainty.

Meanwhile, it is possible to evaluate any design-based estimator, such as the HT estimator or a generalized regression estimator, with respect to an assumed model, in which case it is common to conclude that design-based estimation is inefficient or lacks desirable conditional properties (e.g., Valliant et al. 2000). Conversely, a prediction estimator derived optimally under a working model can be evaluated with respect to the sampling design, in which case the danger of model misspecification is frequently noted (e.g., Hansen et al. 1983).

Design-based predictive inference takes the last analysis further, whereby one explicitly evaluates the design-based bias and variance of any given model-based prediction estimator, Equation (1). One can then compare the given model-based estimator to any other estimator, whether the latter is dependent on the design or a working model, and choose according to their design-based properties regardless how they are constructed. The merit of such an approach rests now on the fact that design-based uncertainty assessment is valid, which does not require the model underlying any given estimator to be correct.

To illustrate, under the linear model $E_{M} (y_{i} | x_{i}) = x^{T} β$ and $V_{M} (y_{i} | x_{i}) = σ^{2}$ where $E_{M}$ and $V_{M}$ denote expectation and variance under the model, the best linear unbiased predictor (BLUP) of a population total $Y = \sum_{i \in U} y_{i}$ is

\hat{Y} = X^{T} b and b = (\sum_{i \in s} x_{i} x_{i}^{T})^{- 1} \sum_{i \in s} x_{i} y_{i}

where $X = \sum_{i \in U} x_{i}$ . Given $π_{i} = \Pr (i \in s)$ , we have

E_{p} (\hat{Y}) = X^{T} E_{p} (b) = Y - \sum_{i \in U} {y_{i} - x_{i}^{T} E_{p} (b)}

where $E_{p} (b) \approx (\sum_{i \in U} π_{i} x_{i} x_{i}^{T})^{- 1} \sum_{i \in U} π_{i} x_{i} y_{i}$ with respect to sampling, and

V_{p} (\hat{Y}) = X^{T} V_{p} (b | s) X \approx X^{T} E_{p} {V_{M} (b | s)} X

with respect to sampling, where

E_{p} {V_{M} (b | s)} = E_{p} {σ^{2} {(\sum_{i \in s} x_{i} x_{i^{⊤}})}^{- 1}} \approx (\frac{1}{N} \sum_{i \in U} {y_{i} - x_{i}^{T} E_{p} (b)}^{2}) {(\sum_{i \in U} π_{i} x_{i} x_{i}^{T})}^{- 1} .

Both $E_{p} (\hat{Y})$ and $V_{p} (\hat{Y})$ can then be estimated from $s$ and compared to, say, the approximate variance of a generalized regression estimator of $Y$ .

Thus, a chief advantage of adopting design-based predictive inference to population total estimation based on probability sampling is to circumvent the design versus model controversy, by providing a valid common ground for uncertainty assessment. A theory applicable to the class of prediction estimators, Equation (1), which will be developed in this paper, would allow one to use any assisting ML models or algorithms that can often make more efficient use of auxiliary information than the standard design-based calibration estimation or model-assisted estimation methods.

1.3. Introduction to Individual Estimation

Individual estimation requires the most extremely disaggregated results. It can be useful for constructing statistical registers or census-like statistical data as the basis for descriptive official statistics. However, there has never been a design-based theory for estimation at the individual level.

For instance, having taken a simple random sample of all but two units in a given population, the traditional design-based estimation theory would only allow one to make inference about the total (or mean) of the two out-of-sample units, but not each on its own, no matter how large the sample is or how much auxiliary information one has in addition. This is clearly unsatisfactory, which requires extension of the design-based inference theory.

To illustrate the conceptual issue at hand, suppose on observing ${y_{i} : i \in s}$ in a subset $s$ of the population $U$ , one would like to predict the $y$ -value for each unit out of $s$ by $μ (s) = \sum_{i \in s} y_{i} / n$ , where $n$ is the number of units in $s$ . How can one infer about the loss $D_{s} = \sum_{j \in U \ s} {μ (s) - y_{j}}^{2}$ that is unobserved?

One possibility is to assume a model. For instance, under the model that $y_{i}$ is independent and identically distributed (IID) for any $i \in U$ , we have

E_{M} (D_{s} | s) = (N - n) (1 + n^{- 1}) σ^{2}

with respect to the IID model conditional on the given subset $s$ , where $σ^{2}$ is the variance of $y_{i}$ under the model and $N$ is the number of units in $U$ .

However, we notice that a fundamentally different, design-based approach would in fact be possible if $s$ is selected from $U$ by a known sampling design, denoted by $s ~ p (s)$ , where $\sum_{s \in Ω} p (s) = 1$ and $Ω$ contains all the possible samples from $U$ . For instance, suppose $s$ is selected from $U$ by simple random sampling without replacement (SRSWOR), where $p (s) = 1 / (\begin{matrix} N \\ n \end{matrix})$ , such that

E_{p} (D_{s}) = (N - n) (1 + n^{- 1}) S_{y}^{2}

with respect to $p (s)$ , where $S_{y}^{2} = \sum_{i \in U} (y_{i} - \bar{Y})^{2} / (N - 1)$ and $\bar{Y} = \sum_{i \in U} y_{i} / N$ .

Since $s_{y}^{2} = \sum_{i \in s} {y_{i} - μ (s)}^{2} / (n - 1)$ is both unbiased for $σ^{2}$ under the IID model and unbiased for $S_{y}^{2}$ under SRSWOR, numerically one would obtain the same estimate of the expected loss, even though they have completely different interpretations. While an assumed model would be necessary for evaluating $E_{M} (D_{s} | s)$ if the selection mechanism of $s$ is unknown, it could be invalid if the observed data distribution actually differs to that of the unobserved ones. While the design-based loss $E_{p} (D_{s})$ here requires one to plan and implement the SRSWOR design, it is necessarily valid because $p (s)$ is known.

We shall develop a general design-based theory for the out-of-sample loss, provided the observations are obtained by probability sampling. This would yield valid inference of the associated risk with respect to the given sampling design, where all the outcomes $y_{U}$ and features $x_{U}$ are treated as constants.

2. Total Prediction Estimator

Consider the prediction estimator, Equation (1), where $μ (x, s)$ can be obtained by any model or algorithm fitted to the full sample $s$ . Now, hypothetically speaking, it is clear that design-unbiased estimation of $\hat{Y} - Y = \sum_{j \in R} μ (x_{j}, s) - y_{j}$ , or some function of it, would be possible given an additional probability sample $r$ selected from $U \ s$ , because one can then observe the error $e_{j} = μ (x_{j}, s) - y_{j}$ for any $j \in r$ . In the absence of extra observations ${y_{j} : j \in r}$ , valid design-based inference requires creating observed errors within the sample $s$ .

Denote by $s_{1} \cup s_{2} = s$ and $s_{1} \cap s_{2} = \emptyset$ a training-test sample split, where $s_{1}$ is selected by a subsampling design, denoted by $q (s_{1} | s)$ . For example, $s_{1}$ of size $n_{1}$ can be randomly sampled from $s$ with or without replacement. Or, in $T$ -fold cross-validation, $s$ is first randomly partitioned into $T$ clusters and then each cluster is selected as $s_{2}$ one by one systematically, yielding $s_{1} = s \ s_{2}$ accordingly. Denote by $μ (x, s_{1})$ the predictor obtained from the subsample $s_{1}$ , in the same way as $μ (x, s)$ from $s$ . Its error $μ (x_{j}, s_{1}) - y_{j}$ can be observed for any $j \in s_{2}$ .

As in Sanguiao-Sande and Zhang (2021), we shall refer to the sampling design that yields $(s_{1}, s)$ as the $pq$ -design, denoted by

f (s_{1}, s) = q (s_{1} | s) p (s) = f (s | s_{1}) f (s_{1})

(2)

where the last product indicates that, conditional on the training set $s_{1}$ , one can view the test set $s_{2}$ as a probability sample from $U \ s_{1}$ , according to which $s$ can vary under the $pq$ -design. In particular, for any $i \in U$ , let

π_{2 i} = \Pr (i \in s_{2} | s_{1}) = \sum_{s \in i, i \notin s_{1}} f (s | s_{1})

(3)

be its conditional $s_{2}$ -inclusion probability given $s_{1}$ under the $pq$ -design.

The theory of design-based predictive inference we develop below, including both Theorems 1 and 2, apply generally provided any well-defined $pq$ -design, Equation (2). However, the typical $q$ -designs mentioned above may need to be modified in practice, in order to accommodate the additional complex sampling features that may exist. For instance, given a stratified $p$ -design of $s$ , it may be natural to subsample $s_{1}$ within each stratum as well. Or, given a multistage $p$ -design, the $q$ -design must involve subsampling of the selected primary sampling units (PSUs), instead of only subsampling elements within all the selected PSUs, such that conditional sampling by $f (s | s_{1})$ covers the whole population.

Now, conditional on any given $s_{1}$ , the design-based bias and mean squared error (MSE) of the prediction estimator using $μ (x, s_{1})$ can be easily derived. The theory below explains how one can infer the design-based bias and MSE of the prediction estimator (1) that is trained using all the observations in $s$ , with respect to $s ~ p (s)$ , by appropriate averaging over the $q$ -design.

2.1. SRB Prediction Estimator

Given $s_{1}$ under the $pq$ -design, the subsample-trained prediction estimator is

{\hat{Y}}_{1}^{*} = \sum_{i \in s} y_{i} + \sum_{j \in U \ s} μ (x_{j}, s_{1})

whose total error for $Y$ is given by

B = \sum_{j \in U \ s} e_{1 j} = B_{1} - B (s_{2})

where $e_{1 j} = μ (x_{j}, s_{1}) - y_{j}$ for any $j \notin s_{1}$ , and

B_{1} = \sum_{j \in U \ s_{1}} e_{1 j} and B (s_{2}) = \sum_{j \in s_{2}} e_{1 j}

Note that $B$ and $B (s_{2})$ vary with $(s, s_{2})$ conditional on $s_{1}$ while $B_{1}$ is fixed.

Applying Rao-Blackwellisation to ${\hat{Y}}_{1}^{*}$ yields a corresponding SRB prediction estimator in the form of Equation (1), which is given by

{\hat{Y}}^{RB} = \sum_{i \in s} y_{i} + \sum_{j \in U \ s} \bar{μ} (x_{j}, s)

(4)

where

\bar{μ} (x_{j}, s) = E_{q} {μ (x_{j}, s_{1}) | s}

(5)

Notice that $\bar{μ} (x, s)$ is a particular full-sample trained predictor, and its special notation $\bar{μ}$ is introduced to distinguish it from any $μ (x, s)$ that uses the same $μ$ but is directly trained once on the full sample $s$ . Sanguiao-Sande and Zhang (2021) refer to the operation $E_{q} (\cdot | s)$ as SRB, since it yields the conditional expectation over subsampling $s_{1} ~ q (s_{1} | s)$ , where the unordered $s$ is the minimal sufficient statistic with respect to the sampling distribution $p (s)$ .

It follows from the definition that the bias of ${\hat{Y}}^{RB}$ is given by

E_{p} ({\hat{Y}}^{RB}) - Y = E_{pq} ({\hat{Y}}_{1}^{*}) - Y = E_{pq} (B) .

Given $s_{1}$ , a conditionally unbiased predictor of $B$ is given by

\hat{B} = \sum_{j \in s_{2}} (π_{2 j}^{- 1} - 1) e_{1 j}

in the sense that, as $s$ varies conditional on $s_{1}$ , we have

E_{s} (\hat{B} | s_{1}) = B_{1} - E_{s} {B (s_{2}) | s_{1}} = E_{s} (B | s_{1})

since $e_{1 j}$ resulting from $μ (x_{j}, s_{1})$ trained on the subsample $s_{1}$ is fixed with respect to sampling of $s_{2}$ by $f (s | s_{1})$ conditional on $s_{1}$ . Applying Rao-Blackwellisation to $\hat{B}$ yields then a more efficient estimator

{\hat{B}}^{RB} = E_{q} (\hat{B} | s)

(6)

as an unbiased estimator of the bias of ${\hat{Y}}^{RB}$ with respect to $p (s)$ , since

E_{p} ({\hat{B}}^{RB}) = E_{p} {E_{q} (\hat{B} | s)} = E_{s_{1}} {E_{s} (\hat{B} | s_{1})} = E_{s_{1}} {E_{s} (B | s_{1})} = E_{pq} (B) .

Notice that one needs at least $| s_{2} | = 1$ to calculate $\hat{B}$ , in which case $μ (x, s_{1})$ is trained on $n - 1$ units, that is, the so-called leave-one-out (LOO) predictor, whose difference to $μ (x, s)$ is only due to one randomly selected unit.

Moreover, one can estimate unbiasedly the MSE of ${\hat{Y}}^{RB}$ by the following result, the proof of which is given in Appendix.

Theorem 1. For any given $μ (\cdot)$ , an unbiased estimator of the MSE of the SRB prediction estimator (4), over $s ~ p (s)$ , is given by

{mse}^{RB} = E_{q} {{\hat{B}}^{2} - {\hat{V}}_{s} (\hat{B} | s_{1}) + {\hat{V}}_{s} {B (s_{2}) | s_{1}} | s} - V_{q} ({\hat{Y}}_{1}^{*} | s)

(7)

where $\hat{B} = \sum_{j \in s_{2}} (π_{2 j}^{- 1} - 1) {μ (x_{j}, s_{1}) - y_{j}}$ , and ${\hat{V}}_{s} (\hat{B} | s_{1})$ is unbiased for

V_{s} (\hat{B} | s_{1}) = \sum_{i \notin s_{1}} \sum_{j \notin s_{1}} (π_{2 ij} - π_{2 i} π_{2 j}) (\frac{1}{π_{2 i}} - 1) (\frac{1}{π_{2 j}} - 1) e_{1 i} e_{1 j}

where $π_{2 ij} = \Pr (i, j \in s_{2} | s_{1})$ , and ${\hat{V}}_{s} {B (s_{2}) | s_{1}}$ is unbiased for

V_{s} {B (s_{2}) | s_{1}} = \sum_{i \notin s_{1}} \sum_{j \notin s_{1}} (π_{2 ij} - π_{2 i} π_{2 j}) e_{1 i} e_{1 j} .

Notice that one needs at least $| s_{2} | = 2$ to calculate the variance estimators in Equation (7), in which case $μ (x, s_{1})$ is trained on $n - 2$ units, that is, the leave-two-out (LTO) predictor, which differs to $μ (x, s)$ only due to two randomly selected units.

2.2. Discussion

In a recent discussion of cross-validation for prediction error estimation under the IID model of $(y_{i}, x_{i})$ , Bates et al. (2024, Theorem 3) have shown that, given a sample $s$ of size $n$ , it is possible to estimate unbiasedly the MSE of $(K - 1)$ -fold cross-validation on a reduced sample of size $n (K - 1) / K$ , but not for the $K$ -fold cross-validation on the sample of size $n$ which one actually does. Unbiased MSE estimation for the latter is generally difficult if $μ (x, s)$ does not have a wieldy expression, because by definition one cannot observe the error of $μ (x, s)$ without extra out-of-sample observations.

We have obtained design-unbiased MSE estimator, Equation (7), for the SRB prediction estimator ${\hat{Y}}^{RB}$ that uses $\bar{μ} (x, s)$ , but not for $\hat{Y}$ using the more familiar $μ (x, s)$ that is trained once on the full sample. Design-unbiased estimation of $MSE (\hat{Y})$ faces the same general difficulty as model-based MSE estimaton. Indeed, similarly to reduced-sample cross-validation, we could easily estimate unbiasedly the design-based MSE of the hypothetical prediction estimator

{\hat{Y}}_{1} = \sum_{i \in s_{1}} y_{i} + \sum_{j \in U \ s_{1}} μ (x_{j}, s_{1})

as if $s_{1}$ was the sample (instead of the actual $s$ ) such that $μ (x, s_{1})$ was the full-sample once-trained predictor. Analogously to Theorem 1, we would have $E_{q} {{\hat{B}}_{1} | s}$ as an unbiased estimator of the design-based bias of ${\hat{Y}}_{1}$ , where

{\hat{B}}_{1} = \sum_{i \in s_{2}} π_{2 i}^{- 1} {μ (x_{i}, s_{1}) - y_{i}}

and $E_{q} {{\hat{B}}_{1}^{2} - {\hat{V}}_{s} ({\hat{B}}_{1}) | s}$ unbiasedly for $MSE ({\hat{Y}}_{1})$ , where ${\hat{V}}_{s} ({\hat{B}}_{1})$ is unbiased for

V_{s} ({\hat{B}}_{1}) = \sum_{i \notin s_{1}} \sum_{j \notin s_{1}} (\frac{π_{2 ij}}{π_{2 i} π_{2 j}} - 1) {μ (x_{i}, s_{1}) - y_{i}} {μ (x_{j}, s_{1}) - y_{j}}

Next, as will be illustrated later, we note that $MSE ({\hat{Y}}^{RB})$ using $\bar{μ} (x, s)$ can provide a good approximation to $MSE (\hat{Y})$ using $μ (x, s)$ , where ${\hat{Y}}^{RB}$ and $\hat{Y}$ should be close to each other now that both of them are trained on the full sample, as long as $μ$ is “stable” in a suitable sense. Specifically, we have

\hat{Y} - Y \equiv E_{q} (\hat{Y} - Y | s) = E_{q} {\sum_{j \notin s} μ (x_{j}, s) - μ (x_{j}, s_{1}) | s} + ({\hat{Y}}^{RB} - Y)

where the term $E_{q} {\cdot}$ is of a lower order to ${\hat{Y}}^{RB} - Y$ , if $μ$ is $n_{2}$ -times $q$ -stable. Sanguiao-Sande and Zhang (2021) define the LOO-predictor $μ$ to be $q$ -stable, if $μ (x, s) - μ (x, s_{1}) \to^{P} 0$ asymptotically, as $n, N \to \infty$ , while $n_{2} = 1$ is held fixed. Here, $μ$ is said to be $n_{2}$ -times $q$ -stable if the same holds given any $n_{2} \geq 1$ .

Finally, notice that, apart from familiarity and custom, there is no reason why one cannot adopt the SRB prediction estimator, Equation (4), for which MSE estimation is unbiased, instead of using the once-trained predictor $μ (x, s)$ .

2.3. Notes on Implementation

2.3.1. Sampling Probability

By definition, Equation (3), $π_{2 i}$ is the conditional $s_{2}$ -inclusion probability given $s_{1}$ , the calculation of which generally requires $f (s | s_{1})$ that is derived from $q (s_{1} | s) p (s)$ . However, $p (s)$ is unknown for many unequal-probability sampling methods in practice, such as the cube method (Deville and Tillé 2004), although the inclusion probability $π_{i} = \Pr (i \in s)$ is always known.

One can use instead another sampling probability of the $pq$ -design. For any $i \in U$ , let its conditional test-set inclusion probability given $i \notin s_{1}$ be

ϕ_{2 i} = \Pr (i \in s_{2} | i \notin s_{1}) = \frac{\Pr (i \in s_{2}, i \notin s_{1})}{\Pr (i \notin s_{1})} = \frac{π_{i} {1 - \Pr (i \in s_{1} | i \in s)}}{1 - π_{i} \Pr (i \in s_{1} | i \in s)} .

(8)

Given $π_{i}$ , the probability $ϕ_{2 i}$ can be calculated as long as $\Pr (i \in s_{1} | i \in s)$ does not depend on $i$ under the subsampling design and can be specified regardless the realized $s$ , such as SRS of $s_{1}$ from $s$ with or without replacement, or $T$ -fold cross-validation. Since

ϕ_{2 i} = \frac{\sum_{s_{1} : i \notin s_{1}} \sum_{s : i \in s \cap i \notin s_{1}} f (s | s_{1}) f (s_{1})}{\sum_{s_{1} : i \notin s_{1}} f (s_{1})} = \frac{\sum_{s_{1} : i \notin s_{1}} π_{2 i} f (s_{1})}{\sum_{s_{1} : i \notin s_{1}} f (s_{1})} = E_{s_{1}} {π_{2 i} | i \notin s_{1}},

it is the conditional expectation of non-zero $π_{2 i}$ , where $π_{2 i} = 0$ iff $i \in s_{1}$ .

To illustrate, take the special case of SRSWOR of $s$ from $U$ and SRSWOR of $s_{1}$ from $s$ , with sample sizes $n = | s |$ , $n_{1} = | s_{1} |$ , and $n_{2} = | s_{2} | = n - n_{1}$ . For any given $s_{1}$ and $i \notin s_{1}$ , we have exactly

π_{2 i} = \frac{\Pr (i \in s_{2}) f (s_{1} | i \in s_{2})}{f (s_{1})} = \frac{\frac{n_{2}}{N} {(\begin{matrix} N - 1 \\ n_{1} \end{matrix})}^{- 1}}{{(\begin{matrix} N \\ n_{1} \end{matrix})}^{- 1}} = \frac{n_{2}}{N - n_{1}} = ϕ_{2 i} .

Similarly, instead of $π_{2 ij}$ in Equation (7), one can use another conditional joint $s_{2}$ -inclusion probability

ϕ_{2 ij} = \Pr (i, j \in s_{2} | i, j \notin s_{1}) = \frac{\Pr (i, j \in s_{2} | i, j \in s) π_{ij}}{\Pr (i, j \notin s_{1})}

where $\Pr (i, j \in s_{2} | i, j \in s)$ is known for random subsampling of $s_{1}$ or $T$ -fold cross-validation, and $\Pr (i, j \notin s_{1}) = 1 - \Pr (i \in s_{1}) - \Pr (j \in s_{1}) + \Pr (i, j \in s_{1})$ .

2.3.2. Monte Carlo SRB

By Equation (7), the exact ${mse}^{RB}$ for the LTO-SRB prediction estimator ${\hat{Y}}^{RB}$ requires $C (n, 2) = (\begin{matrix} n \\ 2 \end{matrix})$ sample splits. If feasible, it would also provide a good estimator of $MSE (\hat{Y})$ given any two-times $q$ -stable $μ$ as explained earlier. The variance of ${mse}^{RB}$ should be comparable to that of the HT variance estimator; indeed, the former tends to be smaller than the latter if the model-assisted ${\hat{Y}}^{RB}$ is more efficient than the HT-estimator ‘assisted’ only by $x_{i} = π_{i} N / n$ (Section 1).

The exact SRB operation may be infeasible if $C (n, 2)$ is too large, in which case it needs to be replaced by Monte Carlo Rao-Blackwellisation (MC-RB) using $T$ sample splits, $T << C (n, 2)$ . The corresponding MC-SRB prediction estimator, denoted by ${\tilde{Y}}^{RB}$ , is given by replacing $\bar{μ} (x, s)$ in Equation (4) by

\tilde{μ} (x, s) = \frac{1}{T} \sum_{t = 1}^{T} μ (x, s_{1}^{(t)})

based on sample splits $(s_{1}^{(t)}, s_{2}^{(t)})$ for $t = 1, . . ., T$ . The MSE of ${\tilde{Y}}^{RB}$ is then

MSE ({\tilde{Y}}^{RB}) = MSE ({\hat{Y}}^{RB}) + E_{p} {V_{q} ({\hat{Y}}_{1}^{*} | s)} / T .

The unbiased exact-RB estimator of $MSE ({\tilde{Y}}^{RB})$ follows from Equation (7), that is,

{mse}^{RB} ({\tilde{Y}}^{RB}) = E_{q} {{\hat{B}}^{2} - {\hat{V}}_{s} (\hat{B} | s_{1}) + {\hat{V}}_{s} {B (s_{2}) | s_{1}} | s} - \frac{T - 1}{T} V_{q} ({\hat{Y}}_{1}^{*} | s),

such that the unbiased MC-RB estimator of $MSE ({\tilde{Y}}^{RB})$ is given by

{\tilde{mse}}^{RB} = \frac{1}{T} \sum_{t = 1}^{T} ({\hat{B}}^{2 (t)} - {\hat{V}}_{s} (\hat{B} | s_{1}^{(t)}) + {\hat{V}}_{s} {B (s_{2}) | s_{1}^{(t)}} - {{\hat{Y}}_{1}^{* (t)} - {\tilde{Y}}^{RB}}^{2}),

(9)

where the index $t$ explicates the computation results for $t = 1, . . ., T$ .

For the LTO ${\tilde{Y}}^{RB}$ , the ratio $V_{pq} ({\tilde{mse}}^{RB}) / V_{p} ({mse}^{RB})$ converges to 1 much slower than $MSE ({\tilde{Y}}^{RB}) / MSE ({\hat{Y}}^{RB})$ , as $T$ increases. The inflation of $V_{pq} ({\tilde{mse}}^{RB})$ is almost entirely due to approximating the $E_{q} {\cdot}$ -term in Equation (7) by MC in Equation (9), wherein the terms are all estimated from $s_{2}^{(t)}$ of the size $n_{2} = 2$ . While the MC variance of $V_{pq} ({\tilde{mse}}^{RB})$ can be reduced by increasing either $T$ or $n_{2}$ , the reduction is considerably faster as $n_{2}$ increases. Note that the condition of $n_{2}$ -times $q$ -stability of $μ$ is put under a greater pressure as $n_{2}$ increases, which might affect whether $MSE ({\tilde{Y}}^{RB})$ would remain close to $MSE (\hat{Y})$ , where $\hat{Y}$ is based on $μ (x, s)$ .

Therefore, in practice, one can first explore the MC variance of ${\tilde{mse}}^{RB}$ in relation to $n_{2}$ using a moderately large $T$ , in order to choose a value of $n_{2}$ that reduces the MC variance as much as possible without jeopardizing the $n_{2}$ -times stability condition, and then compute ${\tilde{mse}}^{RB}$ with the chosen $n_{2}$ using $T$ that is as large as computationally practical.

2.4. Illustration

Let us illustrate with a simple example. Generate and fix a population of size $N = 1000$ by $y_{i} = β_{1} x_{1 i} + β_{2} x_{2 i} + ϵ_{i}$ with IID $x_{1 i} ~ LogN (1, 1)$ , $x_{2 i} ~ Poisson (5)$ , and $ϵ_{i} ~ N (0, σ^{2} / 4)$ , where $σ^{2}$ is the population variance of $x_{1 i}$ . Let $s$ be given by SRSWOR with $n = 100$ . Let SRSWOR be the subsampling $q$ -design of $s_{1}$ with size $n_{1}$ that is to be specified. Let the mis-specified predictor be $μ (x, s) = x^{T} β$ , with $x = (1, x_{1})$ but omitting $x_{2}$ . The full-sample once-trained $μ (x, s)$ and the SRB $\bar{μ} (x, s)$ are approximately but not exactly equal to each other, where

μ (x, s) = x^{T} (\sum_{i \in s} x_{i} x_{i}^{T})^{- 1} (\sum_{i \in s} x_{i} y_{i})

\bar{μ} (x, s) = E_{q} {μ (x, s_{1}) | s} = x^{T} E_{q} ({(\sum_{i \in s} I (i \in s_{1}) x_{i} x_{i}^{T})}^{- 1} (\sum_{i \in s} I (i \in s_{1}) x_{i} y_{i}) | s)

Given a single realized sample $s$ , as in practice, Figure 1 illustrates the MC variance (left) and expectation (right) as $n_{2}$ increases from 2 toward $n$ under this setup, given $T = 10^{3}$ . The MC variance is seen to decrease sharply to a plateau as $n_{2}$ increases from 2, before it increases dramatically again as $n_{2}$ gets close to $n$ . Meanwhile, the MC expectation is quite stable, say, for $n_{2} \leq 40$ . For example, setting $n_{2}$ to be 20 or 30 for the final computation of ${\tilde{mse}}^{RB}$ using $T = 10^{5}$ , we would obtain $0.020$ or $0.014$ as the MC coefficient of variance (CV), which seem acceptable for practical purposes.

Figure 1.

MC variance (left) and expectation (right) of ${\tilde{mse}}^{RB}$ given $n_{2} \geq 2$ .

Table 1 shows the results of simulating MSE estimation based on 250 independent samples, given $T = 10^{3}$ and $n_{2} = 2, 20, 30$ . The MSE is simply the average squared error of either $\hat{Y}$ or ${\hat{Y}}^{RB}$ over the 250 samples, and the relative efficiency (RE) is the ratio between either MSE and the variance of the HT-estimator. Notice that the three $MSE (\hat{Y})$ here are all estimators of the same MSE, each using 250 independent samples, since $\hat{Y}$ depends only on $s$ .

Table 1.

MSE Estimation from 250 Samples, $T = 10^{3}$ , $μ (x, s)$ for $\hat{Y}$ , and $\bar{μ} (x, s)$ for ${\hat{Y}}^{RB}$ (Training, Test) Set of Size $(n_{1}, n_{2})$ , RE Against Variance of HT-Estimator.

$(n_{1}, n_{2})$	$MSE (\hat{Y})$	$RE (\hat{Y})$	$MSE ({\hat{Y}}^{RB})$	$RE ({\hat{Y}}^{RB})$	$CV ({\tilde{mse}}^{RB})$
(98, 2)	386,532.7	0.44	386,632.4	0.44	3.48
(80, 20)	363,613.9	0.41	363,441.5	0.41	0.31
(70, 30)	362,673.0	0.41	357,146.9	0.41	0.21

For $n_{2}$ up to 20 (or even 30), $MSE ({\hat{Y}}^{RB})$ is practically equal to $MSE (\hat{Y})$ . The CV of the MC-MSE estimator ${\tilde{mse}}^{RB}$ is drastically reduced by setting $n_{2}$ to 20 or 30 instead of 2. In comparison, the CV of the exact-RB MSE estimator ${mse}^{RB}$ is 0.14 by simulation, whereas the CV of the HT variance estimator is 0.32. This confirms that setting $n_{2}$ to be 20 (or even 30) and using a larger but practical $T$ would work satisfactorily for MSE estimation in this setup.

In terms of the choice of estimator, we notice that the mis-specified predictor $μ (x, s)$ yields a design-based MSE that is less than half of the variance of the HT-estimator, and the bias of $\hat{Y}$ or ${\hat{Y}}^{RB}$ is a negligible part of the MSE here, the details of which are omitted to save space. The assessment is enabled by the design-based predictive inference theory developed above. Finally, as mentioned before, there is no reason why one cannot adopt ${\hat{Y}}^{RB}$ using $\bar{μ} (x, s)$ , for which MSE estimation is unbiased, instead of $\hat{Y}$ using $μ (x, s)$ .

3. Individual Prediction Estimator

Consider the individual-level predictor $μ (x, s)$ in the prediction estimator (1). Regardless how $μ (x, s)$ is obtained from ${(y_{i}, x_{i}) : i \in s}$ , using whichever model or algorithm, its total squared error (TSE) over $R = U \ s$ is given by

D (s; μ) = \sum_{i \in R} {μ (x_{i}, s) - y_{i}}^{2} .

For design-based individual-level predictive inference, we define the risk of $μ$ to be the expectation of $D (s; μ)$ over repeated sampling of $s ~ p (s)$ , while treating $y_{U}$ and $x_{U}$ as fixed, denoted by

τ (μ) = E_{p} {D (s; μ)} .

(10)

We stress that only $s$ is random in Equation (10), that is, it is a design-based measure, regardless the model or algorithm by which $μ (x, s)$ is constructed.

3.1. Risk of SRB Predictor

Under the same $pq$ -design of $(s_{1}, s)$ as in Section 2, the error of the subsample-trained predictor $μ (x, s_{1})$ for any $i \notin s_{1}$ is given by

e_{i} (μ, s_{1}) = μ (x_{i}, s_{1}) - y_{i}

which can be observed for any unit in the test set $s_{2} = s \ s_{1}$ . Let

D_{R} (s_{1}; μ) = \sum_{i \in R} e_{i} (μ, s_{1})^{2}

be the TSE of $μ (x, s_{1})$ over $R = U \ s$ . Let

A_{2} = \sum_{i \in s_{2}} e_{i} (μ, s_{1})^{2} = \sum_{i \in U \ s_{1}} e_{i} (μ, s_{1})^{2} - D_{R} (s_{1}; μ) .

Given $s_{1}$ , both $A_{2}$ and $D_{R} (s_{1}; μ)$ vary with $s_{2} = s \ s_{1}$ under the $pq$ -design, but their sum $\sum_{i \in U \ s_{1}} e_{i} (μ, s_{1})^{2}$ is fixed. The predictor

{\hat{D}}_{R} (s_{1}; μ) = \sum_{i \in s_{2}} (π_{2 i}^{- 1} - 1) e_{i} (μ, s_{1})^{2}

is unbiased for $D_{R} (s_{1}; μ)$ conditional on $s_{1}$ , since

E_{s} {{\hat{D}}_{R} (s_{1}; μ) | s_{1}} = E_{s} {\sum_{i \in s_{2}} π_{2 i}^{- 1} e_{i} (μ, s_{1})^{2} - A_{2} | s_{1}} = E_{s} {D_{R} (s_{1}; μ) | s_{1}} .

(11)

Meanwhile, the TSE of the SRB-predictor $\bar{μ} (x, s)$ given by Equation (5) is

D (s; \bar{μ}) = \sum_{i \in R} e_{i} (\bar{μ})^{2} and e_{i} (\bar{μ}) = \bar{μ} (x_{i}, s) - y_{i} .

For any $i \in R$ with $x_{i} = x$ , the errors of $\bar{μ} (x, s)$ and $μ (x, s_{1})$ are related by

e_{i} (μ, s_{1}) = μ (x, s_{1}) - y_{i} = {μ (x, s_{1}) - \bar{μ} (x, s)} + e_{i} (\bar{μ}) .

Since $\bar{μ} (x, s)$ and $e_{i} (\bar{μ})$ are constant of $s_{1} ~ q (s_{1} | s)$ , we have

e_{i} (\bar{μ})^{2} = E_{q} {e_{i} (μ, s_{1})^{2} | s} - E_{q} {a_{i} (μ, s_{1})^{2} | s}

(12)

where $a_{i} (μ, s_{1}) = μ (x, s_{1}) - \bar{μ} (x, s)$ and $E_{q} {a_{i} (μ, s_{1})^{2} | s}$ is the variance of $μ (x, s_{1})$ under the SRB operation. Design-unbiased estimation of the risk $D (s; \bar{μ})$ is given by the result below, the proof of which is given in Appendix.

Theorem 2. For any given $μ (\cdot)$ , an unbiased estimator of the risk $τ (\bar{μ})$ of the corresponding SRB-predictor $\bar{μ} (x, s)$ , over $s ~ p (s)$ , is given by

\hat{D} (s; \bar{μ}) = E_{q} (\sum_{i \in s_{2}} (π_{2 i}^{- 1} - 1) {e_{i} {(μ, s_{1})}^{2} - a_{i} {(μ, s_{1})}^{2}} | s) .

In practice, where exact SRB is infeasible numerically, one can use the MC-SRB predictor based on $T$ subsamples, which is given as

{\begin{matrix} \tilde{μ} (x_{i}, s) = T^{- 1} \sum_{t = 1}^{T} μ (x_{i}, s_{1}^{(t)}) if i \in R \\ \overset{\cdot}{μ} (x_{i}, s) = T_{i}^{- 1} \sum_{t = 1}^{T} I (i \notin s_{1}^{(t)}) μ (x_{i}, s_{1}^{(t)}) if i \in s \end{matrix}

where $s_{1}^{(t)}$ is the $t$ -th subsample, $T_{i} = \sum_{t = 1}^{T} I (i \notin s_{1}^{(t)})$ and $s_{2}^{(t)} = s \ s_{1}^{(t)}$ .

To estimate the risk, for any $i \in s_{2}^{(t)}$ , let $e_{i} (μ, s_{1}^{(t)}) = μ (x_{i}, s_{1}^{(t)}) - y_{i}$ directly, and let $a_{i} (μ, s_{1}^{(t)}) = μ (x_{i}, s_{1}^{(t)}) - \overset{\cdot}{μ} (x_{i}, s)$ be an out-of-bag approximation to $μ (x_{i}, s_{1}^{(t)}) - \bar{μ} (x_{i}, s)$ , instead of $μ (x_{i}, s_{1}^{(t)}) - \tilde{μ} (x_{i}, s)$ that would have been a residual-based alternative. The MC risk estimator is given by

\tilde{D} (s; \bar{μ}) = \frac{1}{T} \sum_{t = 1}^{T} \sum_{i \in s_{2}^{(t)}} (π_{2 i}^{- 1} - 1) {e_{i} (μ, s_{1}^{(t)})^{2} - a_{i} (μ, s_{1}^{(t)})^{2}} .

(13)

3.2. Using an Ensemble of Predictors

By design-based predictive inference, there is no need to assume that a true model exists for $y_{U}$ , or that one is able to identify the true model under repeated sampling. It is then natural to combine an ensemble of different predictors (e.g., Dietterich 2000; Zhou 2012; Dong et al. 2020; Sagi and Rokach 2018) in addition to selecting a single model and the corresponding predictor. Ensemble SRB prediction by voting or averaging is developed below.

3.2.1. SRB-Selector

Consider selecting a single model by voting given an order- $K$ heterogeneous ensemble ${μ_{1}, . . ., μ_{K}}$ . Let $D (s; μ_{k}) = \sum_{i \in R} {μ_{k} (x_{i}, s) - y_{i}}^{2}$ . Denote by $Ω = ⋃_{k = 1}^{K} Ω_{k}$ the partition of the sample space such that, for any $s \in Ω_{k}$ and $l \neq k$ , we have

D (s; μ_{k}) < D (s; μ_{l})

where we discount the possibility of $D (s; μ_{k}) = D (s; μ_{l})$ merely to simplify the exposition. To select a single predictor for $R$ based on a given sample $s$ , which minimises the risk Equation (10), one would vote for $μ_{k} (x, s)$ iff $s \in Ω_{k}$ . The optimal selector is thus the perfect classifier of $I (s \in Ω_{k})$ .

In practice it is a common approach to apply cross-validation and majority-vote, where cross-validation is based on $s_{1} ~ q (s_{1} | s)$ and $s_{2} = s \ s_{1}$ . The expected selection result is given by the SRB-selector as follows. Given any $(s_{1}, s_{2})$ by $q (s_{1} | s)$ and any $k = 1, . . ., K$ , let

δ_{k} (s_{1}) = {\begin{matrix} 1 if k = \arg min_{l = 1, . . ., K} \sum_{i \in s_{2}} {μ_{l} (x_{i}, s_{1}) - y_{i}}^{2} \\ 0 otherwise \end{matrix}

indicate which predictor has the least TSE in $s_{2}$ . The SRB-selector

{\bar{δ}}_{k} (s) = {\begin{matrix} 1 if k = \arg max_{l = 1, . . ., K} E_{q} {δ_{k} (s_{1}) | s} \\ 0 otherwise \end{matrix}

(14)

is a classifier of $I (s \in Ω_{k})$ , that is, the expected majority-vote over cross-validation. Given the selection by Equation (14), say, $μ_{k}$ , one can reuse the same cross-validation samples $(s_{1}, s_{2})$ to obtain the selected SRB-predictor ${\bar{μ}}_{k}$ and its associated risk.

3.2.2. Mixed SRB-Predictor

Consider averaging given an order- $K$ ensemble ${μ_{1}, . . ., μ_{K}}$ , and let the mixed SRB-predictor be

μ (x, s) = \sum_{k = 1}^{K} w_{k} {\bar{μ}}_{k} (x, s)

(15)

where $\sum_{k = 1}^{K} w_{k} = 1$ and $w_{k} > 0$ for the mixing weights, $k = 1, \dots, K$ . We have

D (s; μ) = \sum_{k \neq 1} w_{k}^{2} D_{kk} + {(1 - \sum_{l \neq 1} w_{l})}^{2} D_{11} + \sum_{k = 1}^{K} \sum_{l \neq k, 1} w_{k} w_{l} D_{kl} + 2 \sum_{k \neq 1} w_{k} (1 - \sum_{l \neq 1} w_{l}) D_{1 k}

now that $w_{1} = 1 - \sum_{k \neq 1} w_{k}$ , where $D_{kk} = D (s; {\bar{μ}}_{k})$ and $D_{kl} = D (s; {\bar{μ}}_{k}, {\bar{μ}}_{l})$ is given by

D_{kl} = \sum_{i \in R} e_{i} ({\bar{μ}}_{1}) e_{i} ({\bar{μ}}_{2}) = \sum_{i \in R} E_{q} {e_{i} (μ_{1}, s_{1}) e_{i} (μ_{2}, s_{1}) | s} - E_{q} {a_{i} (μ_{1}, s_{2}) a_{i} (μ_{2}, s_{2}) | s},

that is, similarly to Equation (12). An estimator of $D_{kl}$ follows as a corollary of Theorem 2, as well as its MC implementation similarly to Equation (13).

The optimal mixing weights $w_{k}$ minimise $D (s; μ)$ . The estimated ${\hat{w}}_{k}$ can be obtained via $\hat{D} (s; μ)$ given ${\hat{D}}_{kl}$ , for all $k, l = 1, . . ., K$ . Substituting ${\hat{w}}_{k}$ in Equation (15) yields the mixed SRB-predictor. The associated risk Equation (10) can be estimated by $\hat{D} (s; μ)$ .

Whilst the above approach aims at minimising the risk, it may experience instability if the ensemble is not sufficiently heterogeneous. A robust approach to mixed ensemble prediction should automatically aim at the same mixing weight of two component predictors that are equal to other.

For any $k = 1, . . ., K$ , write $\hat{D} (s; {\bar{μ}}_{k}) = E_{q} ({\hat{τ}}_{k} | s)$ , similarly to $\hat{D} (s; \bar{μ})$ in Theorem 2. Regarding the risk of ${\bar{μ}}_{k} (x, s)$ defined by Equation (10), we have

τ ({\bar{μ}}_{k}) = E_{s_{1}} {τ ({\bar{μ}}_{k} | s_{1})} = E_{s_{1}} {E_{s} ({\hat{τ}}_{k} | s_{1})} = E_{p} {E_{q} ({\hat{τ}}_{k} | s)}

where $τ ({\bar{μ}}_{k} | s_{1})$ is its conditional risk given $s_{1}$ . Let the SRB operation yield

w_{k} = E_{q} (δ_{k} | s), δ_{k} = {\begin{matrix} 1 if k = \arg min_{l = 1, . . ., K} {\hat{τ}}_{l} \\ 0 otherwise \end{matrix} .

(16)

The corresponding mixed SRB-predictor, Equation (15), is robust against $μ_{k} \approx μ_{l}$ for any $k \neq l$ . While the SRB-selector, Equation (14), is a binary classifier taking the majority-vote over all $(s_{1}, s_{2})$ , the robust mixing weight, Equation (16), is a proportion over all $(s_{1}, s_{2})$ .

3.3. Illustration

Simulations below provide a simple illustration of design-based individual-level predictive inference. For better appreciation of the design-based approach, we include also the risk estimator under an assumed IID error model.

We generate 200 sets of $y_{U}$ of population size $N = 2000$ in an ad hoc manner. For each $y_{U}$ , half of them are generated by M1 below and half of them by M2, where $x_{1} ~ N (0, 1)$ and $x_{2} ~ Poisson (5)$ ,

\begin{matrix} (M 1) y = x_{1} + 0.5 x_{2} + ϵ, ϵ ~ {\begin{matrix} N (0, 1) if z = 1 \Leftrightarrow x_{2} < 3 \\ N (- 2, 1) if z = 2 \Leftrightarrow 3 \leq x_{2} < 7 \\ N (2, 1) if z = 3 \Leftrightarrow x_{2} \geq 7 \end{matrix} \\ (M 2) y = 0.5 + 1.5 x_{1} + x_{2} + ϵ, ϵ ~ z^{2} + N (0, 0.25), z ~ N (0, 1), \end{matrix}

From each population we draw a sample of size $n = 200$ either by SRSWOR or Poisson Sampling. For Poisson Sampling, we set $π_{i}^{- 1} \propto 1 + 1 / \exp (α + 0.5 y_{i})$ and $\sum_{i \in U} π_{i} = n$ , where $α \in {1, - 0.1, - 1}$ leads to the coefficient of variation of $π_{i}$ over $U$ , denoted by ${cv}_{π}$ , to be about 15%, 30%, and 45%, respectively. This illustrates a situation where sample selection may cause issues for uncertainty assessment by the IID model.

Let an order-3 model ensemble contain linear regression, random forest, and support vector machine. Let the feature vector be $x = (x_{1}, x_{2})$ in all the cases. We use a 70-30 random split for subsampling of $(s_{1}, s_{2})$ and let $T = 50$ for relevant MC-SRB operations such as Equation (13). We obtain thus the SRB-predictor as described in Equation (5) corresponding to each model.

For each SRB predictor, we estimate its standardized risk, Equation (10), $τ / | R |$ , as described before, where we have $π_{2 i} \equiv n_{2} / (N - n_{1})$ under SRSWOR given $n_{1} = | s_{1} |$ and $n_{2} = | s_{2} |$ , and we use $ϕ_{2 i}$ given by Equation (8) instead of $π_{2 i}$ under Poisson Sampling. Note that if $\hat{τ}$ is unbiased for $τ$ over repeated sampling from a given population, then it is also unbiased for $D / | R |$ over all the 200 populations.

The average of the 200 true $D / | R |$ for each SRB predictor will be referred to as average true MSE in the results below, which is given by

MS E_{true} = \frac{1}{200} \sum_{b = 1}^{200} \frac{1}{N - n} \sum_{i \in R^{(b)}} {\tilde{μ} (x_{i}, s^{(b)}) - y_{i}}^{2}

where $s^{(b)}$ denotes the $b$ -th simulated sample and $R^{(b)}$ the corresponding out-of-sample units. The proposed MSE estimator, Equation (13), will be compared to two MSE estimators that rely on the IID error model (e.g., James et al. 2013): the residual-based estimator and the cross-validation (CrV) estimator. The average of the latter two estimators over the 200 populations are given as

\begin{matrix} MS E_{resid} = \frac{1}{200} \sum_{b = 1}^{200} \frac{1}{n} \sum_{i \in s^{(b)}} {\tilde{μ} (x_{i}, s^{(b)}) - y_{i}}^{2} \\ MS E_{CrV} = \frac{1}{200 T} \sum_{b = 1}^{200} \sum_{t = 1}^{T} \frac{1}{n_{2}} \sum_{i \in s_{2}^{(b, t)}} {μ (x_{i}, s_{1}^{(b, t)}) - y_{i}}^{2} \end{matrix}

where $s^{(b)} = s_{1}^{(b, t)} \cup s_{2}^{(b, t)}$ signifies the $t$ -th subsampling of the $b$ -th sample.

Table 2 displays average true MSE and its estimates across the simulation settings. Regardless the model, the proposed design-based risk estimator Equation (13) is unbiased under SRSWOR $p$ -design where $π_{2 i}$ is known. Whereas, using $ϕ_{2 i}$ instead of $π_{2 i}$ under informative Poisson sampling, it remains essentially unbiased when $({c v}_{π} = 15 %)$ or $30 %$ , but the approximation may be seen to have caused some underestimation as ${cv}_{π}$ increases to 45%, where the severest underestimate is $(9.288 - 9.866) / 9.866 \times 100 = - 5.86 %$ .

Table 2.

MSE and Estimates Given Each Model, Averaged Over 200 Simulations.

MSE	SRSWOR			PS $({c v}_{π} = 15 %)$
MSE	LR	RF	SVM	LR	RF	SVM
Average, true	8.399	9.013	9.272	8.566	9.225	9.671
Design, proposed	8.409	9.073	9.326	8.416	9.182	9.615
Model, CrV	8.457	9.481	9.862	8.014	9.214	9.405
Model, residual	8.162	5.105	7.706	7.766	4.945	7.578
MSE	PS $({c v}_{π} = 30 %)$			PS $({c v}_{π} = 45 %)$
	LR	RF	SVM	LR	RF	SVM
Average, true	8.957	9.726	10.451	9.866	10.884	11.573
Design, proposed	8.711	9.559	10.196	9.288	10.364	10.974
Model, CrV	7.624	8.880	8.799	6.992	8.262	7.933
Model, residual	7.369	4.731	7.330	6.776	4.367	6.758

Note. PS = Poisson sampling; LR = linear regression; RF = random forest; SVM = support vector machine.

The CrV-based IID-model MSE estimator is also essentially unbiased under SRSWOR, because the out-of-bag squared errors in the test sample $s_{2}$ have the same mean as those in $R$ conditional on $s_{1}$ under the $pq$ -design. This is reasonable since the IID error model would hold exactly under SRS with replacement. As mentioned in Subsection 2.2, Bates et al. (2024) explain why the CrV-based MSE estimator is not exactly unbiased for the full-sample once-trained predictor $μ (x, s)$ , even when the IID model is correct. The CrV-based MSE estimator is instead applied to the SRB-predictor $\bar{μ} (x, s)$ here.

The CrV-based MSE estimator can become severely biased though, if the IID model does not hold for the actual sample selection mechanism, as illustrated here for Poisson Sampling with increasing ${cv}_{π}$ . Furthermore, residual-based MSE estimation should be avoided even under SRSWOR $p$ -design, since it generally leads to large biases for predictors derived from highly flexible machine learning models such as random forest and support vector machine.

Next, to illustrate inference for ensemble individual prediction, we obtain the SRB-predictor Equation (5) selected by Equation (14), and the two mixed SRB-predictors using the weights that are either optimal for Equation (15) or robust Equation (16). Given any ensemble MC-SRB predictor $\tilde{μ} (x, s^{(b)})$ , for $b = 1, . . ., 200$ , we calculate its design-based risk estimator using Equation (13) and following the corresponding description in Subsection 3.2. Whereas the two IID model-based MSE estimators are calculated in the same way as given above.

Table 3 presents average true MSE and estimates for the SRB selector and mixed SRB predictors across the simulations. The average true MSE by the SRB-selector is similar to that of the linear model in Table 2, where the MSE is the smallest by this model than random forest or support vector machine. The two mixed SRB predictors achieve largely the same true MSE for individual prediction, in each simulation setting, illustrating the robustness of ensemble prediction approach even when it cannot improve on the best single model in the given ensemble. In terms of MSE estimation, the results in Table 3 are seen to be consistent with what we have observed for Table 2.

Table 3.

MSE and Estimates for Ensemble Individual Prediction, Averaged Over 200 Simulations; Predictor Selected by Majority-Vote, or Averaged by Optimal or Robust Mixing Weights.

MSE	SRSWOR			PS $({c v}_{π} = 15 %)$
MSE	Selected	Optimal	Robust	Selected	Optimal	Robust
Average, true	8.432	8.367	8.380	8.570	8.558	8.578
Design, proposed	8.395	8.260	8.284	8.412	8.343	8.372
Model, CrV	8.453	8.341	8.374	8.015	7.981	8.009
Model, residual	8.076	7.264	7.178	7.746	7.146	7.008
MSE	PS $({c v}_{π} = 30 %)$			PS $({c v}_{π} = 45 %)$
	Selected	Optimal	Robust	Selected	Optimal	Robust
Average, true	8.980	8.985	9.012	9.897	9.915	9.981
Design, proposed	8.700	8.657	8.691	9.281	9.257	9.316
Model, CrV	7.631	7.618	7.647	7.006	6.997	7.035
Model, residual	7.289	6.902	6.667	6.989	6.977	7.008

Note. PS = Poisson sampling.

4. Illustrative Application

Here we describe an illustrative application of design-based prediction estimation for the Spanish Structural Business Survey (SSBS). The SSBS provides information about the main structural and economic characteristics of businesses, such as employed personnel, turnover, purchases, personnel expenses, taxes, and investments. The target population consists of businesses classified in one of the following economic sectors: industrial sector, commercial sector, and services sector.

We take the year 2020 for reference, where the SSBS population contained 2,615,811 business units and the estimated total turnover was 1,785 billion euros, which is related to the total value of market sales of goods and services to third parties during the reference year. The SSBS estimation is traditionally based on the HT-estimator. One of motivations of this study, which is directly related to the SSBS, is the need to investigate whether it is possible to reduce the SSBS sample size, by developing and introducing more efficient estimation approaches.

4.1. $pq$ -Design for SSBS

The SSBS sample contains both fully surveyed business units and other units that are mainly imputed. For our purpose here, we shall only consider the sample of fully surveyed units, which are selected using a stratified random sampling design. There is as usual an exhaustive (i.e., take-all) stratum for the largest businesses, which will be excluded from the application below, since sampling error does not exist there. In addition, any stratum with only one or two sample units will be removed, because some variance smoothing techniques would be needed for these strata in practice, which have no direct relevance to the theory of design-based predictive inference.

A total of 9,681 strata are retained in this way, which contain altogether 2,018,561 population units. As shown in the top plot of Figure 2, the stratum population size is relatively small for most strata but has a skewed distribution. The biggest stratum does have 54,770 units and there are 319 strata with more than one thousand units. The histogram of stratum sample size is given in the bottom plot of Figure 2. Around a half of all the strata have five or fewer sample units, whereas the number of strata with sample size greater than 25 is 339. The total sample size is 80,280.

Figure 2.

Histogram of stratum size in population (top) and sample (bottom).

To investigate the potentials of sample size reduction, we selected randomly and without replacement 45% of the original sample units in each stratum, subjected to a minimum of three sample units in each stratum. The resulting total sample size is 40,514, which is about half of the original sample size. Alternative model-assisted estimators (to be described below) will be compared based on this reduced sample, which would demonstrate both the proposed inference approach and the potentials of sample reduction.

We notice that although the ad hoc reductions of stratum sample sizes above should not be taken as a proposal for the new SSBS sampling design, the estimation results based on this realized sample are more “tangible” than the alternative, whereby one first estimates the MSE of a given estimator based on the original sample and then speculates how this MSE might have changed had the total sample size been reduced by 50%. Moreover, insofar as our aim here is not the specifics of the new SSBS sampling design, the single realized sample is large enough to warrant the comparison of different estimation approaches, and there is no need to simulate the sample reduction above many times which would only have generated largely similar results.

We adopt within-stratum SRSWOR as the subsampling $q$ -design for any SRB-based estimator, in addition to the stratified random sampling $p$ -design above. A constant subsampling rate will be implemented in all the strata, where the stratum subsample sizes of $(n_{1}, n_{2})$ are truncated to integers by the floor function, subjected to the constraint that $n_{2} \geq 2$ in each stratum, such that variance estimation is feasible. Although other subsampling $q$ -designs and stratum-specific subsampling rates can be explored, a systematic investigation in this respect is beyond the scope of this paper in any case.

4.2. Models and Estimators

We use turnover as the target variable for illustration. The HT-estimators based on the original SSBS sample (with 80,280 units) and the reduced sample (with 40,514 units) provide the baselines for comparison. Four additional estimators will be applied to the reduced sample, which arise from the $2 \times 2$ combination of models (linear, tree) and estimators (prediction, unbiased) as described below.

The model is either linear or tree regression, defined globally regardless the design strata. The linear model uses four features:

• the “administrative” turnover from the corporate incoming tax if available, or imputed stratum-mean (of available administrative turnover) if missing;

• a binary indicator for whether the administrative turnover is missing;

• the operating income according to the tax administration if available, or imputed stratum-mean (of available operating income) if missing;

• a binary indicator for whether the operating income is missing.

The linear regression coefficients are estimated by weighted ridge regression, given the sampling design weights and a small tuning parameter $λ = 0.01$ for the regularization penalty.

The tree regression model uses the following four features, where the missing values are not imputed but left as-is to the software package:

• the administrative turnover,

• the operating income,

• the first digit of the National Classification of Economic Activities,

• the number of employees according to the Business Register.

The tree model is built using the ready-made R package h2o for random forest, where one feature is chosen for each split (mtries = 1), the maximum tree depth is 20 (max_depth = 20) and the minimum number of observations per leaf is 5 (min_rows = 5). The observations are again weighted by the sampling weights.

Given either model, we consider the SRB prediction estimator Equation (4) and the design-unbiased model-assisted SRB-estimator (Sanguiao-Sande and Zhang 2021), where the latter can be given as ${\hat{Y}}_{M} = E_{q} ({\hat{Y}}_{1 M} | s)$ and

{\hat{Y}}_{1 M} = \sum_{i \in s_{1}} y_{i} + \sum_{i \in U \ s_{1}} μ (x_{i}, s_{1}) + \sum_{i \in s_{2}} (π_{2 i}^{- 1} - 1) {μ (x_{i}, s_{1}) - y_{i}}

The SRB operation uses a constant 80-20 sample-split for the linear model and a constant 50-50 sample-split for the tree model. Notice that in the case of tree model, the SRB-predictor Equation (5) is a random forest by construction since it is then the average of $T$ randomly constructed trees.

It is worth pointing out that, unlike the prediction estimator Equation (4) that applies $\bar{μ} (x, s)$ directly to all the out-of-sample units, the SRB-estimator ${\hat{Y}}_{M}$ corrects the bias of each $μ (x, s_{1})$ using the observed errors ${μ (x_{i}, s_{1}) - y_{i} : i \in s_{2}}$ via $π_{2 i}$ ; see Sanguiao-Sande and Zhang (2021) for the details. Although ${\hat{Y}}_{M}$ is thus exactly design-unbiased, it may have a larger MSE than the prediction estimator Equation (4) that uses the same model. Conversely, a prediction estimator would become less attractive if its bias is ‘intolerable’, even though it may have a much smaller MSE than an unbiased estimator that uses the same model.

4.3. Results

Table 4 summarizes the results for the estimators described above. Apart from each estimate $\hat{Y}$ , it is also given the estimated bias (zero for an unbiased estimator), the estimated MSE, the relative error (RErr) given as $\sqrt{MSE} / \hat{Y}$ , and the Monte Carlo (MC) error of the MSE estimator.

Table 4.

Estimation Results (in Billion Euros), $T = 10, 000$ Sample-Splits, Based on Same Reduced Sample Size Unless Indicated Otherwise.

Estimator, model	$\hat{Y}$	Bias	MSE	RErr	MC error
HT-estimator (full sample size)	258	0	94	0.04	—
HT-estimator	252	0	151	0.05	—
SRB-prediction estimator, linear	227	−2	50	0.03	3
SRB-prediction estimator, tree	238	4	27	0.02	5
SRB-estimator, linear	229	0	122	0.05	1
SRB-estimator, tree	234	0	107	0.04	2

Exact Rao-Blackwellisation for MSE estimation is simply beyond reach in this case, where we have more than six thousand strata with sample size three in the reduced sample. If we leave out two units in each stratum under subsampling, then there are more than $3^{6000}$ distinct samples under the $q$ -design just for these strata, because the models are not built separately within each stratum, not to mention the other strata with more sample units.

The MC error is the bootstrap estimated standard deviation of the MC-MSE estimator, which is due to the loss of MSE estimation efficiency by Monte Carlo compared to exact Rao-Blackwellisation (with zero MC error). It can be seen that the relative MC error still does not vanish despite the large number of sample-splits $T = 10, 000$ , for example, it is $3 / 50$ for the linear-model SRB-prediction estimator, and not surprisingly, the relative MC error can increase as the MSE reduces, such as $5 / 27$ for the tree-model SRB-prediction estimator. Nevertheless, the MSE estimation results here are reliable enough for us to distinguish between the different estimators.

First, we notice that the two design-unbiased SRB-estimators have only led to moderate efficiency gains over the HT-estimator. The main reason is likely to be the large number of strata with very few sample units. Basically, in case the SRB-estimator uses $μ (x)$ that is given in advance, it would become a stratified difference estimator, which corrects for the design-based bias of $μ (x)$ for each stratum population total by using only the within-stratum sample units. The situation is largely similar for the SRB-estimator using $μ (x, s)$ estimated based on the whole sample, which may have a small sampling variance itself.

In comparison, the SRB-prediction estimators using the same models can be much more efficient precisely because they do not apply bias correction, that is, they are no longer stratified estimators as the SRB-estimators are, such that they can take full advantage of the reduced variance if the prediction biases are small. The MSE of the linear-model SRB-prediction estimator is only about one third of the sampling variance of the HT-estimator, given the same sample size, whereas the tree model further reduces the MSE by about 50% compared to the linear model. Meanwhile, while the bias of the linear-model prediction estimator is relatively small compared to its root MSE, this is no longer the case for the tree-model prediction estimator, that is, 4 against $\sqrt{27}$ .

This serves to remind one that it is often possible to reduce the MSE at some cost of increasing bias, such as when adopting either model-assisted or model-based estimators traditionally. The theory of design-based predictive inference allows one to estimate both the bias and MSE of a large class of prediction estimators Equation (1). This increases the scope of choice in practice, in order to achieve a sensible trade-off between bias and variance.

Finally, the illustrative results above suffice to demonstrate the potentials of sample size reduction for the SSBS. An appropriate scheme of stratum sample size reduction requires a more systematic investigation though. In particular, the accompanying estimator can be chosen from the broad class of prediction estimators Equation (1), assisted by any model or algorithm and the various features available from the administrative source. But a detailed analysis is needed to take into account the level of dissemination (instead of just an overall total here) and the tolerable trade-off between bias and root MSE.

5. Final Remarks

We have developed a theory of design-based predictive inference from finite-population probability sampling. For population total estimation, one would be interested in the total of the out-of-sample prediction errors, whereas the risk of individual-level prediction depends on the out-of-sample squared prediction errors. The SRB approach provides a unified treatment of both.

Adopting design-based predictive inference for official statistics allows one to circumvent the design versus model controversy. In addition to producing population-level estimates, it provides a theoretical basis for producing statistical registers or census-like data for descriptive official statistics. The theory we propose allows for any assisting ML models or algorithms, which can be more efficient than calibration estimation using only auxiliary totals or the parametric assisting models commonly used in survey sampling.

There are a number of issues worth further investigation, of which we only mention a few here. First, survey nonresponse is unavoidable in practice. Lee et al. (2023) apply a related SRB ensemble learning approach to missing data imputation. It would be helpful to develop a unified SRB approach, which can incorporate survey response under an extended quasi-randomisation framework. Next, other choices of risk than the total squared prediction errors may be considered for individual prediction, or interval estimation may be developed for population total inference wherever the design-based bias of the prediction estimator is deemed non-negligible. Finally, it is worth studying how better to balance between the risk of individual prediction and the MSE of population total estimation associated with the prediction estimator Equation (1).

Footnotes

Appendix

Acknowledgements

We thank three anonymous referees and the Associate Editor for comments that have helped to sharpen our message.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Li-Chun Zhang

Danhyang Lee

Received: November 2023

Accepted: August 2024

References

Bates

Hastie

Tibshirani

2024. “Cross-Validation: What Does It Estimate and How Well Does It Do It?” Journal of the American Statistical Association 119: 1434–45. DOI: https://doi.org/10.1080/01621459.2023.2197686.

Beaumont

J.-F.

Haziza

2022. “Statistical Inference from Finite Population Samples: A Critical Review of Frequentist and Bayesian Approaches.”The Canadian Journal of Statistics 50: 1186–212. DOI: https://doi.org/10.1002/cjs.11717.

Berger

Y. G.

De La Riva Torres

2016. “Empirical Likelihood Confidence Intervals for Complex Sampling Designs.”Journal of the Royal Statistical Society Series B: Statistical Methodology 78: 319–41. DOI: https://doi.org/10.1111/rssb.12115.

Blackwell

1947. “Conditional Expectation and Unbiased Sequential Estimation.”Annals of Mathematical Statistics 18: 105–110. DOI: https://doi.org/10.1214/aoms/1177730497.

Breidt

F. J.

Opsomer

J. D.

2017. “Model-Assisted Survey Estimation with Modern Prediction Techniques.”Statistical Science 32: 190–205. DOI: https://doi.org/10.1214/16-STS589.

Deville

J.-C.

Särndal

C.-E.

1992. “Calibration Estimators in Survey Sampling.”Journal of the American Statistical Association 87: 376–82. DOI: https://doi.org/10.1080/01621459.1992.10475217.

Deville

J.-C.

Tillé

2004. “Efficient Balanced Sampling: The Cube Method.”Biometrika 91: 893–912. DOI: https://doi.org/10.1093/biomet/91.4.893.

Dietterich

T. G.

2000. “Ensemble Methods in Machine Learning.” In International Workshop on Multiple Classifier Systems, edited by Dietterich

T. G.

, 1–15. Berlin, Heidelberg: Springer.

Dong

Cao

Shi

2020. “A Survey on Ensemble Learning.”Frontiers of Computer Science 14: 241–58. DOI: https://doi.org/10.1007/s11704-019-8208-z.

10.

Fisher

R. A.

1956. Statistical Methods and Scientific Inference. Edinburgh and London: Oliver and Boyd.

11.

Geisser

1993. Predictive Inference. New York, NY: Chapman & Hall.

12.

Hansen

M. H.

1987. “Some History and Reminiscences on Survey Sampling.”Statistical Science 2: 180–90. DOI: https://doi.org/10.1214/ss/1177013352.

13.

Hansen

M. H.

Madow

W. G.

Tepping

B. J.

1983. “An Evaluation of Model-Dependent and Probability-Sampling Inferences in Sample Surveys.”Journal of the American Statistical Association 78: 776–93. DOI: https://doi.org/10.1080/01621459.1983.10477018.

14.

Hartley

H. O.

Rao

J. N. K.

1968. “A New Estimation Theory for Sample Surveys.”Biometrika 55: 547–57. DOI: https://doi.org/10.1093/biomet/55.3.547.

15.

Hartley

H. O.

Ross

1954. “Unbiased Ratio Estimators.”Nature 174: 270–1. DOI: https://doi.org/10.1038/174270a0.

16.

Horvitz

D. G.

Thompson

D. J.

1952. “A Generalization of Sampling Without Replacement from a Finite Universe.”Journal of the American Statistical Association 47: 663–85. DOI: https://doi.org/10.1080/01621459.1952.10483446.

17.

James

Witten

Hastie

Tibshirani

2013. An Introduction to Statistical Learning. New York, NY: Springer.

18.

Kalton

2002. “Models in Practice of Survey Sampling.”Journal of Official Statistics 18: 129–54.

19.

Lee

Zhang

L.-C.

Chen

2023. “Robust Quasi-Randomization-Based Estimation with Ensemble Learning for Missing Data.”Scandinavian Journal of Statistics 50: 1263–78. DOI: 10.1111/sjos.12626.

20.

Mickey

M. R.

1959. “Some Finite Population Unbiased Ratio and Regression Estimators.”Journal of the American Statistical Association 54: 594–612. DOI: https://doi.org/10.1080/01621459.1959.10501523.

21.

Neyman

1934. “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection.”Journal of the Royal Statistical Society 97: 558–625. DOI: https://doi.org/10.2307/2342192.

22.

Rao

C. R.

1945. “Information and Accuracy Attainable in the Estimation of Statistical Parameters.”Bulletin of Calcutta Mathematical Society 37: 81–91. DOI: https://doi.org/10.1007/978-1-4612-0919-5_15.

23.

Rao

J. N. K.

2005. “Interplay Between Sample Survey Theory and Practice: An Appraisal.”Survey Methodology 31: 117–38. DOI: https://www150.statcan.gc.ca/n1/pub/12-001-x/2005002/article/9040-eng.pdf.

24.

Rao

J. N. K.

2011. “Impact of Frequentist and Bayesian Methods on Survey Sampling Practice: A Selective Appraisal.”Statistical Science 26: 240–56. DOI: https://doi.org/10.1214/10-STS346.

25.

Rao

J. N. K.

2010. “Bayesian Pseudo-Empirical-Likelihood Intervals for Complex Surveys.”Journal of the Royal Statistical Society Series B: Statistical Methodology 72: 533–44. DOI: https://doi.org/10.1111/j.1467-9868.2010.00747.x.

26.

Sagi

Rokach

2018. “Ensemble Learning: A Survey.”Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8: e1249. DOI: https://doi.org/10.1002/widm.1249.

27.

Sanguiao-Sande

Zhang

L.-C.

2021. “Design-Unbiased Statistical Learning in Survey Sampling.”Sankhya A 83: 714–44. DOI: https://doi.org/10.1007/s13171-020-00224-1.

28.

Särndal

C.-E.

Swensson

Wretman

1992. Model-Assisted Survey Sampling. New York, NY: Springer.

29.

Smith

T. M. F.

1983. “On the Validity of Inferences from Non-Random Sample.”Journal of the Royal Statistical Society: Series A (General) 146: 394–403. DOI: https://doi.org/10.2307/2981454.

30.

Smith

T. M. F

. 1994. “Sample Surveys 1975–1990; An Age of Reconciliation? (with Discussion).”International Statistical Review 62: 5–34. DOI: https://doi.org/10.2307/1403539.

31.

Valliant

Dorfman

R. M.

Royall

R. M.

2000. Finite Population Sampling and Inference: A Prediction Approach. New York, NY: Wiley.

32.

Sitter

R. R.

2001. “A Model-Calibration Approach to Using Complete Auxiliary Information from Survey Data.”Journal of the American Statistical Association 96: 185–93. DOI: https://doi.org/10.1198/016214501750333054.

33.

Zhou

Z. H.

2012. Ensemble Methods: Foundations and Algorithms. Hoboken, NJ: CRC Press.

Design-Based Predictive Inference

Abstract

Keywords

1. Introduction

1.1. Prediction Estimator

1.2. Introduction to Total Estimation

1.3. Introduction to Individual Estimation

2. Total Prediction Estimator

2.1. SRB Prediction Estimator

2.2. Discussion

2.3. Notes on Implementation

2.3.1. Sampling Probability

2.3.2. Monte Carlo SRB

2.4. Illustration

3. Individual Prediction Estimator

3.1. Risk of SRB Predictor

3.2. Using an Ensemble of Predictors

3.2.1. SRB-Selector

3.2.2. Mixed SRB-Predictor

3.3. Illustration

4. Illustrative Application

4.1. pq -Design for SSBS

4.2. Models and Estimators

4.3. Results

5. Final Remarks

Footnotes

Appendix

Acknowledgements

Funding

ORCID iDs

References

4.1. $pq$ -Design for SSBS