Prediction-Powered Estimation: Unbiased Model-Assisted Estimation

Abstract

National statistical agencies increasingly face budget constraints and shrinking sample sizes, while simultaneously gaining access to rich auxiliary data and powerful pre-trained machine learning (ML) and artificial intelligence (AI) models, including Large Language Models (LLMs). Traditional model-assisted estimation techniques, which fit models using survey sample data, are limited by small sample sizes, struggle to leverage complex non-linear relationships in auxiliary data, and cannot accommodate frontier pre-trained models. This work re-examines the use of pre-trained black-box models, fit independently of the survey sample, for design-based parameter estimation. Inspired by the Prediction-Powered Inference (PPI) framework, we introduce the Prediction-Powered Estimator (PPE), an unbiased estimator with an unbiased variance estimator for the survey design setting. We also formalize the use of pre-trained models with the classic difference estimator—which we term the Prediction-Powered Difference (PPD) estimator—and with the Generalized Regression Estimator via predicted values as covariates ( ${GREG}_{\hat{y}}$ ). Through LLM-based use-cases leveraging unstructured auxiliary data (images and text) and experiments with real-world survey data from Statistics Canada, complemented by simulation studies in the Supplemental Material, we demonstrate that these approaches consistently outperform standard baseline estimators across bias, mean absolute error, mean squared error, coverage, and confidence interval width. The results suggest that pre-trained models can yield more accurate and efficient estimates while potentially reducing survey sample sizes and respondent burden, and motivate expanding the survey methodologist’s toolbox to include pre-trained models and novel auxiliary data sources.

Keywords

estimation survey sampling model-assisted machine learning difference estimator

1. Introduction

Machine learning (ML) and artificial intelligence (AI) provide increasingly accurate prediction and imputation tools, with applications ranging from administrative data linkage to automated coding systems at national statistical agencies. At the same time, agencies have access to increasingly rich auxiliary data—administrative records, scanner data, images, and text. This convergence presents an opportunity to enhance survey estimation through pre-trained models fit on large external or historical datasets.

Our objectives are to (i) reduce sample sizes and respondent burden without sacrificing accuracy, (ii) accommodate useful non-linear, algorithmic models, including LLMs, and (iii) retain design-based unbiasedness of both estimators and variance estimators. Classical model-assisted methods such as the generalized regression (GREG) and model-calibrated (MC) estimators meet some, but not all, of these goals: they require fitting a (usually linear) model to the survey sample itself, risking overfitting as sample sizes shrink and precluding the use of large pre-trained black-box models.

A natural alternative is to rely on pre-trained models fitted on historical or external data. These models, independent of the current survey sample, can capture complex relationships unreachable with in-sample modeling. The chief concern is distribution shift between training and target populations. Motivated by the Prediction-Powered Inference (PPI) framework, we treat the pre-trained model as a black box and develop three complementary estimators: (1) the Prediction-Powered Estimator (PPE), a PPI-style design-based estimator with an unbiased variance formula; (2) the Prediction-Powered Difference (PPD) estimator, a difference estimator that uses model predictions; and (3) ${GREG}_{\hat{y}}$ , the standard GREG using the model’s predicted values ${\hat{y}}_{i} = f (x_{i})$ as a single covariate.

Each approach offers distinct strengths. The difference estimator is simple and design-unbiased (Breidt and Opsomer 2017); PPE is unbiased and may gain efficiency through a negative covariance term; ${GREG}_{\hat{y}}$ is robust to distributional shifts because the regression corrects systematic prediction errors. This paper introduces PPE as an extension of PPI to the survey setting, re-frames the difference estimator under the PPD label to encourage adoption with pre-trained models, benchmarks all three against classical baselines in simulation and real-world experiments, and demonstrates valid, efficient inference under moderate distributional shift.

The remainder of the paper is organized as follows: Section 2 reviews classical estimators and PPI. Section 3 extends PPI to the survey design setting, defining the PPE estimator and its variance estimator, and formalizes all three proposed estimation strategies. LLM-based use-cases are presented in Section 4, followed by experiments on real-world survey data from Statistics Canada in Section 5. Simulation experiments benchmarking the proposed estimators against classical baselines are provided in the Supplemental Material. Section 6 concludes.

2. Background

Survey methodologies aim to estimate population characteristics such as the total $t_{y} = \sum_{i \in U} y_{i}$ or mean $\bar{y} = t_{y} / N$ for a finite population of size $N$ , with units indexed $i \in U = {1, \dots, N}$ . We consider measurable designs with fixed sample size $n < N$ , realized sample $s$ , first-order inclusion probabilities $π_{i} = E [1_{i \in s}] > 0$ , and second-order probabilities $π_{ij} = E [1_{i \in s} 1_{j \in s}] > 0$ , for all $i, j \in U$ . Scalars are lower-case, vectors bold lower-case, and matrices bold upper-case.

2.1. Horvitz-Thompson Estimator

In the absence of auxiliary data, the Horvitz-Thompson (HT) estimator of the population mean is ${\hat{μ}}_{HT} = \frac{1}{N} \sum_{i \in s} \frac{y_{i}}{π_{i}}$ . The HT estimator is unbiased, with unbiased variance estimator

\hat{V} ({\hat{μ}}_{HT}) = \frac{1}{N^{2}} \sum_{k, ↕ \in s} \frac{Δ_{k ℓ}}{π_{k ℓ}} \frac{y_{k}}{π_{k}} \frac{y_{ℓ}}{π_{ℓ}}, Δ_{k ℓ} = π_{k ℓ} - π_{k} π_{ℓ},

making it a reliable basis for design-based inference (Särndal et al. 1992, 43), though it does not exploit auxiliary information.

2.2. Model-Assisted Estimators

Assume a (row) vector of auxiliary variables $x_{i} \in R^{m}$ is available for all $i \in U$ . The Generalized Regression (GREG) estimator (Särndal et al. 1992, 225) is given by:

{\hat{μ}}_{GREG} = \frac{1}{N} \sum_{i \in U} \hat{m} (x_{i}) + \frac{1}{N} \sum_{i \in s} \frac{y_{i} - \hat{m} (x_{i})}{π_{i}},

(1)

where $\hat{m} (x_{i}) = x_{i} \hat{β}$ is a linear model fit to the survey sample ${(x_{i}, y_{i})}_{i \in s}$ . The GREG is asymptotically design-unbiased (Särndal et al. 1992, 225–38), with approximate variance estimated via weighted residuals (Särndal et al. 1992, 235). At small sample sizes, however, both the estimator and its variance estimator can be substantially biased.

The Model-Calibrated (MC) estimator of Wu and Sitter (2001) extends GREG to non-linear models $g (x, \hat{θ})$ fit on the survey sample, using calibration weights that satisfy $\sum_{i \in s} w_{i} g (x_{i}, \hat{θ}) = \sum_{i = 1}^{N} g (x_{i}, \hat{θ})$ . The MC estimator takes the same imputation-plus-residual form as GREG with $g (x, \hat{θ})$ in place of the linear predictor, and its variance is similarly estimated from in-sample residuals. Under regularity conditions on $g$ , Wu and Sitter (2001) show the estimator and variance estimate are asymptotically unbiased. However, this excludes important non-smooth model classes such as decision trees and random forests, and the reliance on in-sample residuals for variance estimation is susceptible to overfitting, leading to severely underestimated variance in practice (see simulation results in Supplemental Material).

2.3. Difference Estimator and Pre-Trained Models

For any values $z_{i}$ defined for all $i \in U$ and independent of the sample, the difference estimator (Cassel et al. 1976) is:

{\hat{μ}}_{DE} = \frac{1}{N} \sum_{i \in U} z_{i} + \frac{1}{N} \sum_{i \in s} \frac{y_{i} - z_{i}}{π_{i}} .

(2)

This estimator is design-unbiased, with unbiased variance estimator based on the residuals $y_{i} - z_{i}$ (Särndal et al. 1992, 223). The key insight is that $z_{i}$ may be any pre-specified values: in particular, predictions $z_{i} = f (x_{i})$ from a model $f$ fit on data independent of the survey sample. This is the setting we exploit here.

When a pre-trained model $f$ is used, we refer to this as the Prediction-Powered Difference (PPD) estimator:

{\hat{μ}}_{PPD} = \frac{1}{N} \sum_{i \in U} f (x_{i}) + \frac{1}{N} \sum_{i \in s} \frac{y_{i} - f (x_{i})}{π_{i}},

(3)

with unbiased variance estimator:

\hat{V} ({\hat{μ}}_{PPD}) = \frac{1}{N^{2}} \sum_{i \in s} \sum_{j \in s} (\frac{π_{ij} - π_{i} π_{j}}{π_{ij}}) (\frac{y_{i} - f (x_{i})}{π_{i}}) (\frac{y_{j} - f (x_{j})}{π_{j}}) .

(4)

The variance decreases as model residuals shrink: a high-quality pre-trained model yields an efficient estimator. If there is significant distributional shift between training and survey data (concept drift), residuals may be large, reducing efficiency—but the estimator and variance estimator remain unbiased. The magnitude of concept drift can be assessed diagnostically from the survey sample residuals.

More precisely, the variance should be understood as a conditional variance, conditioned on the survey design, the pre-trained model $f$ , the auxiliary data $X$ , the response values $y$ , and the design information $Z_{U}$ available prior to sampling: $\hat{V} ({\hat{μ}}_{PPD} ∣ f, Z_{U}, X, y)$ . For notational simplicity, we suppress these conditioning arguments throughout, with the understanding that all variance expressions are implicitly conditional on these elements—in particular on $f$ .

2.4. Recent Developments

Sanguiao-Sande and Zhang (2021) introduced a novel design-unbiased method known as subsampling Rao-Blackwellisation (SRB), which integrates modern machine learning (ML) techniques into the framework of design-based inference for finite populations. Their approach addresses the increasing use of complex, nonlinear ML models in survey sampling by maintaining design validity—ensuring inference remains grounded in the known sampling design, regardless of model correctness. The SRB method synthesizes three classical ideas in statistical inference: (i) model-assisted estimation, which uses auxiliary information to improve efficiency while preserving design-based properties; (ii) cross-validation, a standard technique in ML for estimating prediction error; and (iii) the Rao-Blackwell Theorem, which enhances estimator efficiency by conditioning on sufficient statistics. Unlike traditional model-assisted methods that rely on asymptotic design consistency, the SRB approach aligns with the Neyman-Fisher notion of consistency for finite populations, providing exact design-unbiasedness for a given population and sampling method while leveraging the predictive power of ML models. However, SRB requires fitting the model on subsets of the survey sample, limiting its applicability to complex or large model classes and making it computationally expensive.

Dagdoug et al. (2023) extended random forests to model-assisted estimation, showing asymptotic design-consistency. However, performance depends critically on the hyper-parameter $n_{0}$ (minimum terminal node size), and coverage targets were not reliably achieved across the range of settings studied.

These approaches share a key limitation relative to the PPD/PPE framework: they fit models using the survey sample, which is typically small and shrinking. Pre-trained models, by contrast, are fit on large external datasets and treat the survey sample only as the labeled set for bias-correction.

2.5. Prediction-Powered Inference (PPI)

PPI was introduced by Angelopoulos et al. (2023a) and quickly gained attention within the ML research community, with extensions including Bayesian PPI (Hofer et al. 2024), CrossPPI (Zrnic and Candès 2024b), PPI++ (Angelopoulos et al. 2024), and a generalization to active statistical inference (Zrnic and Candès 2024a). Despite its promise, no prior work has adapted PPI to the survey design setting.

PPI assumes access to a black-box model $f : X \to Y$ , which can be linear, non-linear, and arbitrarily complex (e.g., neural network, random forest, LLM, etc.). $f$ is fit on data independent of the estimation dataset, and to two data sources: a labeled set ${(x_{i}, y_{i})}_{i \in s}$ (the survey sample) and an unlabeled auxiliary set outside $s$ . The PPI estimator takes the form:

{\hat{μ}}_{ppi} = \underset{{\hat{μ}}^{f} (imputation term)}{\underset{︸}{\frac{1}{| U ∖ s |} \sum_{i \notin s} f (x_{i})}} + \frac{1}{n} \underset{\hat{Δ} (rectifier)}{\underset{︸}{\sum_{i \in s} (y_{i} - f (x_{i}))}},

(5)

where ${\hat{μ}}^{f}$ is the mean predicted value over the unlabeled data outside $s$ , and the rectifier $\hat{Δ}$ is an estimate of the model bias on the labeled data. The rectifier is a classical idea in model-assisted estimation. PPI is more efficient than a naive estimator when $| U ∖ s |$ is large relative to $n$ and when $f$ produces small residuals (Angelopoulos et al. 2023b).

PPI, PPD, and model-assisted estimators such as GREG share similar mathematical forms but differ in two key respects relevant to the survey design setting. First, GREG fits a linear model on the survey sample, while PPI, PPD, and MC place no restriction on the model class. Second, and more importantly, the use of auxiliary data differs: for PPD, GREG, and MC, the imputation term $\frac{1}{N} \sum_{i \in U} f (x_{i})$ is a sum over the entire finite population; when the model $f$ is pre-trained (as in PPD), this term is a design-based constant, whereas for sample-fitted models such as GREG it depends on the sample. In the original PPI framework, the auxiliary data is not assumed to cover the full population, so the imputation term is itself a random quantity. Adapting PPI to the finite-population survey setting—where a census of auxiliary data ${x_{i}}_{i \in U}$ is available—requires introducing probability weights for the complement of the sample $U ∖ s$ , which is the subject of Section 3.

3. Extending PPI for Survey Use: Prediction-Powered Estimator (PPE)

The PPI estimator can be decomposed as ${\hat{μ}}_{ppi} = {\hat{μ}}^{f} + \hat{Δ}$ , where ${\hat{μ}}^{f}$ is the mean imputed value over $U ∖ s$ and $\hat{Δ} = \frac{1}{n} \sum_{i \in s} (y_{i} - f (x_{i}))$ is the rectifier. Adapting this to the survey setting requires introducing probability weights for both terms. The rectifier is handled directly by the HT estimator:

{\hat{μ}}_{HT}^{Δ} = \frac{1}{N} \sum_{i \in s} \frac{y_{i} - f (x_{i})}{π_{i}} .

For the imputation term, since PPI does not assume a census of auxiliary data, we must weight the units not selected into the sample. We abstract the survey design as inducing a distribution over partitions of $U$ into a sampled set $s$ of size $n$ and its complement $U ∖ s$ of size $N - n$ . This allows us to define inclusion probabilities for the complement.

3.1. Inclusion Probabilities for $U ∖ s$

Let $ϕ_{i} : = P (i \notin s) = 1 - π_{i}$ for all $i \in U$ . The second-order probability that both $i$ and $j$ are excluded from $s$ is:

ϕ_{ij} = P (i, j \notin s) = 1 - P (i \in s OR j \in s) = 1 - π_{i} - π_{j} + π_{ij} .

(6)

We impose the natural analogue of measurability: (U1) $π_{i} > 0$ and $π_{ij} > 0$ for all $i, j \in U$ , and (U2) $ϕ_{i} > 0$ and $ϕ_{ij} > 0$ for all $i, j \in U$ . Under these conditions, the HT estimator of $μ^{f}$ over $U ∖ s$ is:

{\hat{μ}}_{ϕ}^{f} = \frac{1}{N} \sum_{i \notin s} \frac{f (x_{i})}{ϕ_{i}},

(7)

which is unbiased for $\frac{1}{N} \sum_{i \in U} f (x_{i})$ .

3.2. The PPE Estimator and Its Variance

Combining the two terms, the Prediction-Powered Estimator (PPE) is:

{\hat{μ}}_{PPE} = {\hat{μ}}_{ϕ}^{f} + {\hat{μ}}_{HT}^{Δ} = \frac{1}{N} \sum_{i \notin s} \frac{f (x_{i})}{ϕ_{i}} + \frac{1}{N} \sum_{i \in s} \frac{y_{i} - f (x_{i})}{π_{i}} .

(8)

Under assumptions (U1)–(U2) and for any $f$ pre-trained independently of $s$ , PPE is unbiased: $E [{\hat{μ}}_{PPE}] = \bar{y}$ .

The variance of PPE decomposes as:

\begin{matrix} V [{\hat{μ}}_{PPE}] = V ({\hat{μ}}_{ϕ}^{f}) + V ({\hat{μ}}_{HT}^{Δ}) + COV ({\hat{μ}}_{ϕ}^{f}, {\hat{μ}}_{HT}^{Δ}) \\ = V ({\hat{μ}}_{ϕ}^{f}) + V ({\hat{μ}}_{HT}^{Δ}) + \frac{1}{N^{2}} \sum_{i \in U} \sum_{j \neq i \in U} (\frac{f (x_{i})}{ϕ_{i}}) (\frac{y_{j} - f (x_{j})}{π_{j}}) COV (1_{i \notin s}, 1_{j \in s}) \end{matrix}

The first two terms are standard HT variances. For the covariance term, noting that:

\begin{matrix} COV (1_{i \notin s}, 1_{j \in s}) = E [1_{i \notin s} 1_{j \in s}] - E [1_{i \notin s}] E [1_{j \in s}] \\ = P (i \notin s, j \in s) - ϕ_{i} π_{j} \\ = (P (j \in s) - P (i, j \in s)) - ϕ_{i} π_{j} \\ = π_{j} - π_{ij} - (1 - π_{i}) π_{j} \\ = π_{i} π_{j} - π_{ij} \end{matrix}

which equals zero for designs with replacement (since $π_{ij} = π_{i} π_{j}$ ) and is close to zero for SRSWOR. An unbiased estimator of the covariance term is:

\hat{Cov} ({\hat{μ}}_{ϕ}^{f}, {\hat{μ}}_{HT}^{Δ}) = \frac{1}{N^{2}} \sum_{i \notin s} \sum_{j \in s} \frac{f (x_{i})}{ϕ_{i}} \cdot \frac{y_{j} - f (x_{j})}{π_{j}} \cdot \frac{π_{i} π_{j} - π_{ij}}{π_{j} - π_{ij}} .

(9)

For SRSWOR this simplifies to $\frac{1}{N} {\bar{f}}_{U ∖ s} {\bar{e}}_{s}$ , where ${\bar{f}}_{U ∖ s}$ is the mean prediction over the excluded units and ${\bar{e}}_{s}$ is the mean residual over the sample. The combined unbiased variance estimator for PPE is:

\hat{V} [{\hat{μ}}_{PPE}] = \hat{V} ({\hat{μ}}_{ϕ}^{f}) + \hat{V} ({\hat{μ}}_{HT}^{Δ}) + \hat{Cov} ({\hat{μ}}_{ϕ}^{f}, {\hat{μ}}_{HT}^{Δ}),

(10)

where each component is estimated by its standard unbiased HT-type estimator. It holds that $E [\hat{V} ({\hat{μ}}_{PPE})] = V [{\hat{μ}}_{PPE}]$ .

Note that when $| U ∖ s |$ is large relative to $n$ , ${\hat{μ}}_{ϕ}^{f} \approx \frac{1}{N} \sum_{i \in U} f (x_{i})$ , so that PPE approximates PPD. The PPE framework nonetheless offers a principled design-based extension of PPI that is exact in finite samples, with the covariance term providing a potential efficiency gain when $f$ systematically over- or under-predicts on $U ∖ s$ .

Remark A third estimator, ${GREG}_{\hat{y}}$ , uses predictions ${\hat{y}}_{i} = f (x_{i})$ as covariates in a standard GREG estimator (Equation (1)). Unlike MC estimation, $f$ is pre-trained and not constrained to smooth function classes; unlike PPD and PPE, the GREG regression step provides additional robustness to systematic prediction bias. ${GREG}_{\hat{y}}$ is asymptotically design-unbiased. Together, PPE, PPD, and ${GREG}_{\hat{y}}$ cover a range of bias-variance trade-offs: PPD and PPE are exactly unbiased and rely directly on model quality, while ${GREG}_{\hat{y}}$ offers resilience when predictions are informative but systematically shifted. A further option is to use pretrained predictions $f (x_{i})$ as the calibration auxiliary variable in the MC framework of Wu and Sitter (2001), enforcing agreement between the sample-weighted and population totals of predictions; we term this the Prediction-Powered Model-Calibration Estimator (PPMCE). This calibration-based refinement of PPD represents an avenue worth considering when additional robustness to systematic prediction bias is desired. A case for pre-trained models over in-sample training arises whenever the variance increase due to concept drift is less than the variance increase due to model training on the typically small survey sample.

4. LLM Use-Cases

Next, we aim to demonstrate the utility of using pre-trained models in the survey sampling setting by considering two use-cases where LLMs can be leveraged. We select two applications where LLMs can perform extremely well: optical character recognition (OCR), which involves parsing desired text from an image, and coding using harmonized systems, such as the North American Product Classification System (NAPCS). In such scenarios, it is infeasible to utilize model-assisted estimation procedures without a pre-trained model, as the amount of data required to fit such a model far exceeds what is collected in a given survey cycle. Note that no further fine-tuning is performed: we use available LLMs as-is and treat them as pre-trained models.

Specifically, we design two sets of experiments. The first use-case simulates estimating population salary statistics (mean salary). In this setting, we simulate small survey samples and have access to auxiliary data in the form of images of T4 tax forms for the entire population, though the goal of this experiment is to show application to any use-case that can make use of OCR on image data, such as PDFs. The second use-case simulates a Statistics Canada survey that aims to estimate a given population’s grocery spending statistics. For instance, this could be for the purpose of obtaining statistical estimates related to access to fresh and processed foods in a remote location. This use-case simulates a small survey paired with auxiliary data in the form of retail scanner data from grocery retailers over a given time period. Statistics are reported in terms of the NAPCS group, class and subclass hierarchy levels. Published estimates at various NAPCS levels are made. This scenario is highly relevant as many statistical organizations spend large efforts and cost on manually coding data using harmonized systems such as NAPCS and North American Industry Classification System (NAICS).

For both sets of experiments, the data used to train the models (Gemini-2.0-flash, GPT-4.0-mini) are not publicly known. For Section 4.2, the data are artificially generated, so the LLM cannot have been trained on this dataset. While it is not possible to formally distinguish between covariate shift and concept drift in this setting, the document structure and visual layout differ from standard OCR benchmarks, suggesting a form of distributional mismatch relative to the LLM’s pretraining data.

For Section 4.3, the dataset consists of a cleaned subset of a tabular Walmart grocery product dataset. The LLM is used to map short retail product descriptions to NAPCS codes, a task-specific classification problem that reflects the operational context of statistical production rather than general-purpose text understanding. Although similar product descriptions may appear in publicly available corpora, the joint distribution of inputs and target labels induced by this classification task is unlikely to align with the LLM’s pretraining objective. We therefore view this setting as exhibiting a form of task-level distributional shift, consistent with concept drift.

4.1. LLM API Details

Both use-cases rely on commercial LLM APIs. For the T4 salary estimation task (Section 4.2), we use Gemini 2.0 Flash via Google’s Generative AI API (Google 2025). Each API call submits one image alongside a short text prompt and receives a structured JSON response via Pydantic schema enforcement (Pydantic 2025), eliminating the need for post-hoc parsing. At the time of experimentation (early 2025), Gemini 2.0 Flash was priced at approximately USD $ 0.10 per 1,000 image input tokens; processing the full population of $N = 10, 000$ T4 images required approximately 10 million tokens and cost roughly USD $100, with a total wall-clock time of approximately three hours using parallelized API calls (batch size 50). For the NAPCS coding task (Section 4.3), we use GPT-4o-mini via OpenAI’s API (OpenAI 2024), which was selected for its cost-efficiency on high-volume classification tasks (approximately USD $ 0.15 per million output tokens). Processing the Walmart dataset ( $\approx 30, 000$ products) through the multi-stage coding pipeline took approximately four hours and cost roughly USD $15. These one-time inference costs scale linearly with population size and are incurred once per survey cycle, independently of the number of Monte Carlo repetitions used in the simulation study.

4.2. Salary Estimation with T4 Forms

4.2.1. Methodology

10,000 simulated T4 forms were generated. An automated script was built to generate a wide variety of different T4 forms, rather than using a single T4 template. This was to demonstrate the ability of the LLM to generalize, showcasing the applicability for any use-case where OCR from a diverse range of documents may be of use. First, data for a population unit was generated, which included a name of the individual, a company name, a province, and salary related information. The T4 form box numbers and titles were used as found from the provided T4 template from (Government of Canada 2025c). Several aspects of the data and form generation process were randomized, this included: whether the form was bilingual or not, the font type and size, text coloring, background color, the inclusion of a simple company logo, the layout structure (number of columns, rows), the use of bounding boxes, small rotations, the use of a watermark, image modifications to simulate paper textures, ordering and organization of the T4 boxes/fields. Associated to each generated image was the ground truth salary data. The gross salaries were first randomly sampled, then all related values were randomly generated conditional on this value.

A total of 1,000 Monte Carlo repetitions of a survey were conducted using simple random sampling without replacement (SRSWOR), with a population of size $N = 10, 000$ , and sample sizes $n \in {10, 20, 50, 100, 1000, 2000}$ . The HT estimator was compared to the three proposed methods of using pre-trained models. Due to the complexity of the task, no other model-assisted estimators were used as baselines (e.g., GREG), as they would surely fail in this setting.

Point-prediction from the LLM was performed as follows. Briefly, we used the Google API (Google 2025) to call Gemini 2.0-flash (Google DeepMind 2025) which was prompted with a simple text instruction and an image (e.g., see Figure 1), of a T4 form. We enforced a structured response by providing a Pydantic response schema (Pydantic 2025) via the API’s structured response capabilities, thereby ensuring that the text response obtained from Gemini-2.0-flash could be parsed and the returned value would be simply the desired numerical value (float), without any additional text, characters, or explanation.

Figure 1.

Example of a synthetic T4 tax statement image. Images are generated randomly, with noise and a variety of formats and styles.

4.2.2. Results

We report the relative bias (RB), relative MAE (RelMAE), and relative MSE (RelMSE), as found in Table 1. The results clearly demonstrate the value of using pre-trained models, in the form of LLMs, for this particular use-case, as compared to the HT estimator. Specifically, all three approaches resulted in a mean reduction in MAE between 72.2% and 75.8%, and a mean reduction in MSE between 91.5% and 93.8%, relative to the HT estimator. Of the three proposed approaches, PPE performed slightly behind the others, while ${GREG}_{\hat{y}}$ slightly out-performed PPD.

Table 1.

Relative Bias, MAE, and MSE of HT, ${GREG}_{\hat{y}}$ , PPE, and PPD Estimators on Estimates Using Synthetic T4 Dataset. Results Are Averages Across the 1,000 Simulation Repeats.

		Sample size (n)
Metrics (%)	Method	10	20	50	100	1,000	2,000	Average
RB	HT	0.621	0.426	0.068	0.065	0.058	0.0	0.214
	${GREG}_{\hat{y}}$	0.0	0.0	0.0	0.0	0.0	0.0	0.0
	PPE	0.0	0.0	0.0	0.0	0.0	0.0	0.0
	PPD	0.0	0.0	0.0	0.0	0.0	0.0	0.0
RelMAE	HT	100.000	100.000	100.000	100.000	100.000	100.000	100.000
	${GREG}_{\hat{y}}$	23.956	23.5	23.7	24.4	25.1	24.6	24.2
	PPE	23.477	24.263	24.345	24.591	30.324	39.652	27.776
	PPD	23.4	24.212	24.226	24.409	25.603	25.173	24.512
RelMSE	HT	100.000	100.000	100.000	100.000	100.000	100.000	100.000
	${GREG}_{\hat{y}}$	6.771	6.3	6.0	5.9	6.2	6.2	6.2
	PPE	6.480	6.633	6.308	6.058	9.111	16.163	8.459
	PPD	6.5	6.607	6.250	5.955	6.528	6.500	6.384

Note. Bold indicates best performing method.

When considering metrics associated to efficiency (see Table 2), the results were fairly consistent. Though the HT estimator did have the highest overall mean coverage (94.4%), the three proposed pre-trained model-assisted estimators performed quite similarly, with coverage ranging from 92.3% to 94.1% (equivalently relative coverage of 97.7%–99.7%). Despite achieving a similar coverage rate to the HT estimator, the proposed approaches did so much more efficiently, with a reduction in CI width ranging from 74.3% to 77.7%, relative to the HT estimator. The Relative MSE $(\hat{V})$ of the variance estimators for both PPD and PPE were consistent with the simulation results (see Supplemental Material). Specifically, PPD and PPE resulted in consistently lower Rel. MSE $(\hat{V})$ (relative to HT), values in the range of $\approx 14 - 25 %$ .

Table 2.

Relative Efficiency Metrics of HT, ${GREG}_{\hat{y}}$ , PPE, and PPD Estimators on Synthetic T4 Dataset. Results Are Averages Across 1,000 Simulation Repeats.

		Sample size (n)
Metrics (%)	Method	10	20	50	100	1,000	2,000	Average
Coverage	HT	91.300	94.7	95.000	94.100	95.6	95.8	94.4
	${GREG}_{\hat{y}}$	86.300	90.500	94.600	94.300	94.100	93.800	92.267
	PPE	91.9	94.200	95.100	94.900	91.800	91.900	93.300
	PPD	91.9	94.300	95.2	95.1	94.400	93.700	94.100
Rel. CI width	HT	100.000	100.000	100.000	100.000	100.000	100.000	100.000
	${GREG}_{\hat{y}}$	20.2	21.1	22.3	23.1	23.4	23.5	22.3
	PPE	22.303	22.388	23.140	23.909	26.823	35.633	25.699
	PPD	22.302	22.384	23.124	23.863	23.994	24.089	23.293
Rel. MSE $(\hat{V})$	HT	100.000	100.000	100.000	100.000	100.000	100.000	100.000
	PPE	13.855	14.068	18.584	19.425	18.611	24.818	18.227
	PPD	13.833	14.025	18.450	19.136	15.395	15.915	16.126

Note. Bold indicates best performing method.

We perform HT-equivalent sample size experiments as a means of highlighting potential real-world efficiency gains, where the HT-equivalent sample size is defined as the smallest sample size that a given model-assisted estimator requires to achieve the same estimation accuracy as the HT estimator at a given reference sample size. Compared to the HT estimator with a sample size of $n = 2000$ , the proposed estimators PPE/PPD/ ${GREG}_{\hat{y}}$ each resulted in point estimates with a reduction in MAE (55%/62%/63%) over HT, with CI width reductions of 60%/64%/65%, all while using half the sample size ( $n = 1000$ ). These results demonstrate the potential benefits of using pre-trained frontier LLMs for model-assisted estimation and in doing so, opens the doors to novel auxiliary data sources, such as image data, not typically leveraged by model-assisted estimation techniques. The LLM based approach resulted in a higher quality estimate, with greater efficiencies as compared to the baseline HT estimator, and HT-equivalent sample size results demonstrate the potential for partial survey sample reduction and response burden reduction.

4.3. NAPCS Grocery Dataset

In these experiments, the aim is to simulate a setting where the goal is to produce household grocery spending estimates associated to product categories of a given NAPCS code level. The publicly open Walmart Grocery Product Dataset (Kaggle 2025) was used and adapted for these studies. This dataset comprises of information regarding products sold from US Walmart Grocery Departments. The data includes the grocery department (e.g., deli), the category (e.g., Hummus, Dips, and Salsa), the product name (e.g., Marketside Roasted Red Pepper Hummus, 10 oz), and the price in USD. We treat the dataset as a proxy of a retail scanner dataset, and treat the product names and prices as retail scanner data.

4.3.1. Methodology

The Walmart Grocery Product Dataset (Kaggle 2025) was processed to remove unnecessary columns and remove duplicate data. The dataset does not come with any ground truth NAPCS codes. We implement an LLM coder to provide target labels for these experiments. For this approach we implement a labeling pipeline. Briefly, we use Google’s API to prompt Gemini-2.0-flash with instructions on the task, provide it with a list and description of 17 pre-filtered NAPCS-3 level (group) codes associated to food and beverages and have the LLM return the top-3 most likely NAPCS-3 level codes associated to the input product data. We call these the predicted groups. Next, we retrieve all the NAPCS-5 level (class) codes associated to the predicted groups, which range from 1 to 16 classes per group. Again, we provide the LLM with instructions on the task, the list of potential classes (with descriptions) and query the LLM to respond with the top-5 most likely classes. Next, we collect all the NAPCS-6 level (subclass) codes associated to the predicted classes, which range from 1 to 51 subclasses per class. We provide the LLM with instructions on the task, the list of potential subclasses (with descriptions) and query the LLM to respond with the single most likely subclass. We use Pydantic (Pydantic 2025) and structured responses to ensure conformity with desired outputs. We treat these LLM provided codes as the target ground truth NAPCS codes for the remainder of the experiments. We do not assert that these are the ground truth NAPCS codes for the data, but assume they are a noisy proxy and useful for research purposes.

To acquire point-predictions from a pre-trained model, we use a similar approach as described for the LLM-coder, however we use GPT-4o-mini from OpenAI (OpenAI 2024). In contrast to the above approach for the LLM-coder, after receiving the class-level predictions and filtering the feasible set of subclasses, we instead query the LLM to respond with the top-10 most likely NAPCS-6 level (subclass) codes. With the top-10 most likely NAPCS-6 level codes, we prompt the LLM one last time and frame the question as a multiple choice question, with each of the 10 possible subclasses associated to multiple choice options: A-J. We prompt the LLM to respond with a single token response, A through J, and receive the log-probabilities of the response from the API, which we convert to probability scores. We assign a probability score of 0 to all subclasses outside this top-10. Once the dataset has been processed, we perform a calibration procedure where we sample 10% (≈ 3,000) of the data and use isotonic regression (Niculescu-Mizil and Caruana 2005; scikit-learn 2025b) to perform probability calibration across all subclasses. At this stage of the model’s point-prediction we can associate to each input instance a probability vector over NAPCS-6 subclasses, which is then multiplied by the product price. One can also view this approach as a set of C point-prediction models, one for each of the C NAPCS-6 subclasses.

1,000 Monte Carlo repeats of a survey were conducted using SRSWOR using a population of size $N = 100, 000$ and sample sizes of $n \in {10, 20, 50, 100, 1000, 2000}$ . For each unit in the population, a “scanner data grocery receipt” was generated by first sampling a number of grocery items uniformly over the range $U [15, 30]$ , representing the number of food items purchased, then that number of rows from the Walmart dataset (with associated LLM-coder NAPCS labels) were sampled uniformly. The resulting set of items represented an individual from the population’s weekly grocery purchase. This simple setup was meant to simulate a survey that sought information on weekly grocery expenditure of a given population, with estimates by a particular NAPCS level.

The HT estimator was compared to the PPE, PPD, and ${GREG}_{\hat{y}}$ approaches. Due to the complexity of the task, no other model-assisted estimators were used as baselines (e.g., GREG), as such approaches would surely fail. In this experiment we also include the coefficient of variation metric (CV), as well as the mean coverage error: $\frac{1}{B} \sum_{m = 1}^{B} \frac{1}{n} \sum_{i = 1}^{n} 1_{θ \notin {CI}_{m}} \min (abs (θ - {UB}_{m}), abs ({LB}_{m} - θ))$ . In words, we compute the average absolute distance of the true population parameter to the closest CI endpoint (UB: upper-bound, LB: lower-bound). This metric is more granular than the mean coverage rate, as the coverage rate is binary.

4.3.2. Result

Aggregated across all NAPCS-6 subclasses, sample sizes, and experimental repeats, PPD and PPE consistently outperform HT across all metrics considered. Both estimators achieved higher mean and median coverage than HT, and for nearly every other metric computed—MAE, MSE, CI width, CV, and coverage error—PPD and PPE yielded approximately a $2 \times$ improvement in mean or median values relative to HT. PPD and PPE also exhibited smaller inter-quartile ranges across metrics, indicating greater robustness across NAPCS codes and sample sizes. The ${GREG}_{\hat{y}}$ estimator reduced mean MAE and coverage error relative to HT, but consistently underperformed PPD and PPE on coverage, achieving the lowest mean coverage of all estimators. Although ${GREG}_{\hat{y}}$ produced the smallest CI widths and CVs, it did so at the cost of systematic undercoverage and showed wider variability across estimates, reflecting less robustness on this task. Detailed results by NAPCS-5 class at $n = 1000$ are given in Table 3.

Table 3.

Summary of Metrics for NAPCS Codes at Sample Size $n = 1000$ (HT/PPD/ ${GREG}_{\hat{y}}$ ).

NAPCS	Coverage (%)	CE ( $\times 10^{- 3}$ )	MAE ( $\times 10^{- 1}$ )	CI width ( $\times 10^{- 1}$ )	CV (%)
11411	93.0/94.6/94.4	2.2/1.4/1.5	0.6/0.5/0.5	2.9/2.3/2.2	14.4/11.3/11.2
11421	91.6/90.6/89.3	1.0/0.6/0.7	0.2/0.1/0.1	0.9/0.4/0.4	34.4/14.7/14.5
11422	92.1/94.3/94.0	1.7/0.5/0.5	0.5/0.2/0.2	2.1/1.1/1.1	10.7/5.5/5.5
11431	94.1/94.9/94.3	1.2/0.5/0.5	0.4/0.2/0.2	1.8/0.8/0.8	11.2/5.3/5.3
11513	93.2/93.3/92.8	2.6/2.6/2.6	0.6/0.6/0.6	3.0/3.0/3.0	20.0/20.0/20.0
11611	93.8/93.5/93.6	0.9/0.9/1.0	0.3/0.3/0.3	1.3/1.3/1.3	22.9/22.9/22.8
12111	75.1/75.1/0.0	4.7/4.7/21.8	0.2/0.2/0.2	0.9/0.9/0.0	71.0/71.0/0.0
17111	93.8/94.8/94.6	5.4/0.9/0.9	1.8/0.4/0.4	8.6/2.1/2.0	8.6/2.1/2.0
17211	94.7/95.1/95.8	4.7/0.9/0.7	1.6/0.4/0.4	7.5/2.0/2.0	11.6/3.2/3.1
17212	94.3/94.3/95.5	2.6/0.8/0.7	1.0/0.3/0.3	4.7/1.5/1.5	14.6/4.6/4.5
17213	95.0/94.3/94.6	2.7/1.8/1.8	1.3/0.6/0.6	6.5/3.1/3.1	9.5/4.5/4.5
17215	93.9/94.5/94.6	6.4/2.8/2.9	2.2/1.0/1.0	10.3/5.2/5.1	4.4/2.2/2.2
17311	95.1/95.1/94.7	2.9/0.7/0.8	1.0/0.4/0.4	4.8/2.0/1.9	5.1/2.1/2.1
17312	93.7/92.9/92.5	3.2/2.9/3.1	0.9/0.7/0.7	4.3/3.3/3.3	9.2/7.0/7.1
17313	93.9/94.9/95.1	3.0/0.7/0.6	1.2/0.3/0.3	6.1/1.5/1.5	5.5/1.4/1.3
18212	93.1/91.1/91.7	1.3/1.0/0.9	0.3/0.2/0.2	1.7/0.9/0.9	20.5/11.0/10.9
18213	94.7/94.0/93.1	2.8/2.4/3.0	1.0/0.8/0.8	4.8/3.9/3.9	13.6/11.1/11.0
18311	94.7/95.9/95.1	1.8/0.6/0.6	0.7/0.3/0.3	3.4/1.6/1.6	8.8/4.2/4.2
18312	95.5/94.9/94.9	1.7/1.2/1.2	0.8/0.5/0.5	4.2/2.7/2.7	5.3/3.4/3.4
18314	94.6/96.1/95.8	4.4/0.9/0.9	1.8/0.6/0.6	8.6/3.2/3.2	5.4/2.0/2.0
18331	94.0/96.1/95.8	5.6/2.2/2.2	1.9/1.1/1.1	9.2/5.2/5.2	4.2/2.4/2.4
18341	93.8/93.8/93.5	5.7/4.5/4.7	1.6/1.4/1.4	8.2/6.9/6.7	11.3/9.5/9.3
18342	94.5/94.3/94.3	5.0/2.2/2.2	1.9/0.9/0.9	8.9/4.3/4.3	3.0/1.5/1.5
18351	92.9/96.2/95.1	5.4/0.9/1.2	1.7/0.7/0.7	8.0/3.7/3.7	4.4/2.1/2.1
18352	93.8/95.5/95.3	5.6/2.5/2.4	2.1/0.9/0.9	10.4/4.6/4.6	3.6/1.6/1.6
19211	95.7/94.9/94.7	1.8/0.7/0.7	0.9/0.4/0.4	4.5/1.8/1.8	6.1/2.5/2.5
21111	95.7/95.0/94.0	4.2/2.4/2.3	2.5/0.7/0.7	12.1/3.6/3.3	6.9/2.0/1.9
21112	94.9/94.8/93.5	9.9/1.5/1.9	4.0/0.6/0.5	20.1/2.7/2.2	6.0/0.8/0.6
Mean	93.4/93.7/90.8	3.6/1.6/2.3	1.2/0.6/0.5	6.1/2.7/2.6	12.6/8.3/5.7

Note. CE = mean coverage error (average distance of the true parameter to the nearest CI endpoint when the CI does not contain the truth, $\times 10^{- 3}$ ). Higher coverage and lower CE are both improvements; the two metrics are not in conflict since CE measures miss magnitude conditional on missing. Bold indicates best performing method.

HT-equivalent sample size experiments were performed, where again HT-equivalent sample size is defined as the smallest sample size that a given model-assisted estimator requires to achieve the same estimation accuracy as the HT estimator at a given reference sample size. Figure 2 shows the HT-equivalent sample size of NAPCS-5 class-level estimates, reported relative to the HT estimator with sample size $n = 1000$ . PPD and PPE perform almost identically. For PPE and PPD, 27 out of 39 ( $> 69 %$ ) of the NAPCS-5 estimates correspond to HT-equivalent sample sizes of 500 or less, while ${GREG}_{\hat{y}}$ does so for 28 out of 39 ( $> 71 %$ ) estimates. For 2/39 and 1/39 of the NAPCS-5 estimates, PPE/PPD and ${GREG}_{\hat{y}}$ , respectively, yield HT-equivalent sample sizes as low as 10, corresponding to a $99 %$ reduction in the HT sample size required to achieve comparable estimation performance. Finally, for each NAPCS-5 code, we plot several performance measures evaluated at the corresponding HT-equivalent sample size and compare them to the HT estimator at $n = 1000$ .

Figure 2.

HT-equivalent sample size experiments for PPE, PPD, and ${GREG}_{\hat{y}}$ with respect to HT@ $n$ = 1,000, at the NAPCS-5 (class) level. The HT-equivalent sample size (top left) is defined as the smallest sample size that a given estimator requires to achieve the same estimation accuracy as HT@ $n$ = 1,000. The remaining subplots show performance measures—relative MAE (top right), relative CI width (bottom left), and mean coverage error (bottom right)—plotted across sample sizes for PPE, PPD, ${GREG}_{\hat{y}}$ , and HT, evaluated at the corresponding HT-equivalent sample size and compared to HT@ $n$ = 1,000.

Across each NAPCS-5 class, the estimates produced by PPE and PPD are more precise and more efficient than the HT estimator at n = 1,000, even when using smaller sample sizes in 27/39 of the estimates. Though ${GREG}_{\hat{y}}$ produced similar trends and results for the MAE and CI width metrics, for 11/39 NAPCS-5 level estimates ${GREG}_{\hat{y}}$ produced larger mean coverage errors than HT (e.g., $\approx 6 \times$ higher for 17214), while this occurred for 4/39 of the estimates for PPE/PPD, and did so with much smaller magnitudes. The full set of metrics associated to each NAPCS-5 code estimate for both HT and PPD can be found in Table 3. Due to space limitations, we omit the PPE results as the results are essentially equivalent to PPD, up to the reported decimal places within the table. For example, for MSE, Coverage Error, CV and bias, the mean absolute difference between PPE and PPD occurs at five decimal places, for MAE and CI width occurs at four decimal places, and for coverage at three decimal places. One interesting finding, consistent with previous experiments, is that even when the HT estimator resulted in a higher mean coverage, PPD almost always resulted in a lower mean coverage error. This suggests that when the resulting CI for the estimate does not contain the ground truth population parameter, the interval endpoints tend to be much closer to the ground truth parameter than the CI’s produced by the HT estimator.

The ${GREG}_{\hat{y}}$ estimator demonstrated similar performance to both PPD and PPE, however it was less consistent, especially with respect to coverage and coverage error. Though the ${GREG}_{\hat{y}}$ estimator resulted in lower CV for these estimates, it did so at the cost of under-coverage. Focusing on Table 3, at sample sizes of n = 1,000, we see that PPD (and PPE) consistently outperform the HT estimator across the metrics computed. For instance, averaged across all NAPCS-5 estimates, PPD achieves 93.7% mean coverage while HT achieves 93.4%, while reducing coverage error by $\approx 55 %$ . Despite having larger coverage rates, the PPD estimator was more efficient, with a $\approx 56 %$ reduction in CI width (2.7 vs. 6.1) over the HT estimator. The ability to use auxiliary data enabled PPD to produce more accurate estimates, with a $\approx 50 %$ reduction in MAE (0.6 vs. 1.2) of the estimates of the population parameters. Together, on average, this yielded a $\approx 34 %$ reduction in CV (8.3 vs. 12.6).

5. Real-World Dataset

In this section, we apply PPD to real-world data collected by Statistics Canada, Canada’s national statistical organization. The Quarterly Survey of Financial Statements (QSFS) collects data used to measure the financial position and performance of incorporated businesses within Canada (Government of Canada 2025b). The data includes items related to assets, liabilities, and equity within a quarterly balance sheet, revenue and expense data reported on a quarterly income statement. The survey is collected quarterly, with a population frame size of approximately 28,000 enterprises. Sampling is carried out using a stratified random sample, with strata constructed based on the size of the unit (assets). Sample sizes are typically around 5,000 units.

The auxiliary data considered includes monthly tax (GST) data and General Index of Financial Information (GIFI) data (Government of Canada 2025a), the previous year’s annual revenue and annual assets within Statistics Canada’s internal Business Register. As well, relatively static administrative data is available for each unit, with information related to the size of the enterprise, the industry that the enterprise is classified under, as well as the North American Industry Classification System (NAICS) code associated to that enterprise. For the GST variables, for each quarter, the GST tax data available includes the GST for the first and second month of that quarter. GIFI codes are associated to particular items found on a corporation’s financial statements (e.g., balance sheets, income statements, and statements of retained earnings). However, there is approximately a two-year lag in the availability of the GIFI data, relative to a given survey quarter. These covariates are used for model fitting and point-predictions at survey time.

5.1. Methodology

For these experiments, quarterly QSFS survey data was available for 11 quarters, from 2022-Q1 to 2024-Q3. For each target quarter, we fit a gradient boosting regression model (Friedman 2001; scikit-learn 2025a) using the full survey data from the immediately preceding quarter (approximately $n \approx 5, 000$ observations, reflecting the operational QSFS sample size). The experiments iterate chronologically over each quarter and use only data that would have been available at the time of each quarter. We apply a consistent model-fitting procedure across all target variables, with no variable-specific adjustments. For each quarter and each variable, data from the preceding quarter is split 70/30 into training and validation sets; hyperparameter tuning is performed for 100 steps using Optuna (Optuna 2024) over standard gradient boosting hyperparameters. The best-performing model is saved and used in downstream simulations.

We ran simulations over the 2022-Q2 to 2024-Q3 quarters (ten quarters total). For each quarter, the full survey data was treated as a population frame and SRSWOR was performed for sample sizes $n \in {25, 50, 100, 200, 500, 1000}$ , repeated for 1,000 Monte Carlo replicates. The following population parameters were estimated: total revenue, total operating revenue, wages and salaries, and sales of goods and services. We compare the HT estimator (baseline) against the PPD estimator. A standard GREG estimator was also evaluated but did not yield reliable results with the available auxiliary data under a linear model assumption, illustrating a setting where auxiliary information is informative non-linearly but linear model-assisted approaches are not viable. Note that whether PPD outperforms unbiased or consistent model-training alternatives under this setup is not directly assessed here; the primary purpose of this experiment is to demonstrate the practical utility of PPD on real survey data across a range of sample sizes and survey quarters.

5.2. Results

High-level aggregate results across all sample sizes, survey quarters, and experimental repeats are shown using violin plots in Figure 3. Relative $(\frac{PPD}{HT})$ MAE, MSE, coverage, coverage error, and CI width are shown for the four target survey variables. Across all experiments, it can be seen that PPD generally reduces MAE ( $\approx 2$ – $2.5 \times$ , on average), reduces MSE ( $\approx 3$ – $5 \times$ , on average), reduces coverage error ( $\approx 2$ – $2.5 \times$ , on average), and reduces CI width ( $\approx 1.5$ – $2.25 \times$ , on average), while maintaining similar coverage rates to the HT estimator.

Figure 3.

Aggregated relative performance of PPD versus HT estimators across various metrics over ten QSFS survey quarters for four survey response variables (see legend).

We explore the HT-equivalent sample sizes for the QSFS variables, defined as the smallest sample size that the PPD estimator requires to achieve the same estimation accuracy as the HT estimator at $n = 1000$ (HT@ $n$ = 1,000). For each survey variable and for each survey quarter we find the smallest sample size in ${25, 50, 100, 200, 500, 1000}$ whereby the PPD estimator achieves higher accuracy (lower MAE) and higher efficiency (smaller CI width).

Figure 4 shows the HT-equivalent sample size (top left) for each survey variable across the QSFS survey quarters ( $x$ -axes), as well as various relative metrics of the PPD estimator when evaluated at the HT-equivalent sample size $(\frac{PPD @ n = effective - n}{HT @ n = 1000})$ ; namely relative MAE (top right), relative MSE (bottom left), and relative CI width (bottom right). For each of the survey response variables, PPD was able to consistently achieve an HT-equivalent sample size smaller than $n = 1000$ , for most of the survey quarters. For each variable we report the min/max/median HT-equivalent sample size over the course of the ten survey quarters: Total revenue (200/500/500); Total operating revenue (200/1,000/500); Sales of goods and services (100/500/500); Wages and salaries (200/1,000/500). The median HT-equivalent sample size for all variables was 500—a $50 %$ reduction in sample size relative to HT. Over the course of the ten survey quarters, 26/30 of the estimates resulted in PPD@ $n < 1000$ , requiring a minimum of half or less sample size than that required for the HT estimator to achieve the same quality of estimation. From 2023-Q3 to 2024-Q3, for each of the survey variables considered, PPD was able to achieve a maximum HT-equivalent sample size of $n = 500$ , and as low as $n = 100$ . For example, the PPD estimate of sales of goods and services for 2024-Q3 with $n = 100$ samples resulted in $\approx 10 %$ reduction in MSE and CI width over the HT estimate which used $10 \times$ the sample size.

Figure 4.

HT-equivalent sample size for QSFS variables (PPD vs. HT@ $n$ = 1,000) over survey quarters. Top left shows the HT-equivalent sample size of PPD for four target survey variables evaluated over ten survey quarters, defined as the smallest sample size that PPD requires to achieve the same estimation accuracy as HT@ $n$ = 1,000. Top right, bottom left, and bottom right show the relative MAE, relative MSE, and relative CI width, respectively, of PPD at the HT-equivalent sample size versus HT@ $n$ = 1,000 across survey quarters. A horizontal line at 1.0 signifies parity with HT.

For each survey response variable we report summary statistics on the relative metrics (Table 4) across the survey quarters. Here, the relative metrics are reported with respect to $\frac{PPD @ effective - n}{HT @ n = 1000}$ , representing a counterfactual scenario of what the relative performance of PPD would have been had the survey sample size been the HT-equivalent sample size. Despite the median HT-equivalent sample size being $n = 500$ , Table 4 shows that the median and mean relative performance of PPD across all survey variables results in an improvement over the HT estimator with $n = 1000$ . For example, the median relative MSE ranged between 54.0% and 62.1%, equivalent to a 37.9% to 46.0% median reduction in MSE over HT, despite PPD utilizing smaller sample sizes. Similarly, the median relative CI width ranged between 69.9% and 78.1%, equivalent to a 21.9% to 30.1% median reduction in CI width over HT, despite PPD using smaller sample sizes.

Table 4.

Relative Performance of $\frac{PPD @ effective - n}{HT @ n = 1000}$ . Summary Statistics of PPD Estimator Evaluated Across Ten QSFS Survey Quarters for Four Survey Response Variables.

Metric (%)	Survey variable	Min	Max	Median	Mean
Rel MAE	Total revenue	60.1	93.3	73.2	74.7
	Total operating revenue	60.9	97.7	77.6	79.3
	Sales of goods and services	65.8	94.0	76.0	76.8
	Wages and salaries	66.5	98.3	78.9	81.3
Rel MSE	Total revenue	37.8	89.4	54.0	57.2
	Total operating revenue	36.9	96.7	60.4	64.0
	Sales of goods and services	44.4	91.7	58.5	60.7
	Wages and salaries	44.6	95.4	62.1	68.0
Rel CI width	Total revenue	59.1	87.8	69.9	71.8
	Total operating revenue	59.0	92.4	72.3	75.1
	Sales of goods and services	61.7	87.8	73.4	72.6
	Wages and salaries	65.6	93.1	78.1	79.1

6. Discussion

This work examined the use of pre-trained black-box models for survey estimation. Despite the difference estimator and ${GREG}_{\hat{y}}$ having been available for decades, their use with pre-trained models has not been widely adopted in practice. This work also developed a mathematical framework to extend the Prediction-Powered Inference (PPI) framework (Angelopoulos et al. 2023a) to the survey design setting, via the Prediction-Powered Estimator (PPE).

We considered three estimators that make use of pre-trained models: PPE, as a direct extension of PPI to survey settings; the Prediction-Powered Difference (PPD) estimator, as a re-framing of the generalized difference estimator (Cassel et al. 1976) when a pre-trained model is used; and ${GREG}_{\hat{y}}$ , which uses point-estimates from a pre-trained model as covariates in a standard GREG estimator. All three differ from classical model-assisted approaches in that the model $f$ is fit on historical or external data rather than on the survey sample itself. This is the key practical advantage when survey samples are small or when the auxiliary data is unstructured, for which in-sample model fitting is not feasible.

Through LLM-based experiments (Section 4) and real-world QSFS experiments (Section 5), the proposed approaches consistently outperformed HT and, where applicable, GREG and MC baselines across a range of sample sizes. The simulation experiments (Supplemental Material) corroborate these findings across a range of controlled data-generating processes. HT-equivalent sample size analyzes suggest that these approaches could contribute to partial reductions in survey sample sizes and respondent burden, though the magnitude of gains will depend on the quality of the pre-trained model and the degree of distributional shift between training and survey data.

The simulation results (Supplemental Material) highlight a fundamental limitation of MC estimation with non-linear models: because the variance is estimated from the residuals of the same data used to fit the model, overfitting produces misleadingly small residuals and severely underestimated variance, resulting in poor coverage—on average just above 10% in the simulations. By contrast, PPE and PPD use residuals evaluated on the survey sample, which is independent of model training, yielding unbiased variance estimates. This distinction is practically important as survey sample sizes decrease and non-response increases, making asymptotic justifications for MC and GREG less reliable.

The results also illustrate that when auxiliary data has a non-linear relationship with the target variable—which is the case for most unstructured data sources—linear model-assisted approaches are insufficient, while pre-trained non-linear models used via PPD, PPE, or ${GREG}_{\hat{y}}$ can be highly effective. When auxiliary data has a linear relationship with the target, standard GREG remains a competitive and well-justified choice.

The performance of the proposed estimators depends directly on the quality of the pre-trained model. In the experiments presented here, no domain-specific model development was undertaken beyond generic hyperparameter tuning (Optuna 2024), and no prompting strategies were optimized for the LLM-based tasks. Results should therefore be interpreted as indicative rather than as an upper bound on achievable performance; targeted model development in a production setting may yield different results depending on the application.

Whether PPD and PPE outperform unbiased or consistent model-training alternatives—such as the SRB estimator of Sanguiao-Sande and Zhang (2021)—in specific survey settings remains an open question and a direction for future work. Additional promising directions include the application of PPE and PPD under more complex survey designs, and the use of pre-trained model predictions to inform strata allocation in stratified random sampling (Godfrey et al. 1984).

These results do not argue that model-assisted techniques which fit models using survey data should be abandoned. Rather, the survey methodologist should consider both in-sample and pre-trained model approaches for their particular use-case, selecting the one most appropriate to the setting, the available auxiliary data, and the model classes that are feasible given the sample size.

Calibration-based refinements of PPD (Prediction-Powered Model-Calibration Estimator), such as using pretrained predictions as calibration variables to enforce agreement between sample-weighted and population prediction totals—in the spirit of the model-calibration framework of Wu and Sitter (2001)—represent a natural extension that may offer additional robustness to systematic prediction bias, and merit further investigation.

Supplemental Material

sj-docx-1-jof-10.1177_0282423X261451320 – Supplemental material for Prediction-Powered Estimation: Unbiased Model-Assisted Estimation

Supplemental material, sj-docx-1-jof-10.1177_0282423X261451320 for Prediction-Powered Estimation: Unbiased Model-Assisted Estimation by Nicholas Denis and Mohammed Haddou in Journal of Official Statistics

Footnotes

Acknowledgements

The authors would like to acknowledge: Jean-Francois Beaumont, Keven Bosa, and Steve Matthews for reviewing and providing helpful feedback on manuscript drafts, and Ivy McKee for help processing QSFS data and providing helpful background information on the QSFS survey.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Nicholas Denis

Supplemental Material

Supplemental material for this article is available online.

Received: May 23, 2025

Accepted: April 15, 2026

References

Angelopoulos

A. N.

Bates

Fannjiang

Jordan

M. I.

Zrnic

2023a. “Prediction-Powered Inference.” Science 382 (6671): 669–74. DOI: https://doi.org/10.1126/science.adi6000.

Angelopoulos

A. N.

Bates

Fannjiang

Jordan

M. I.

Zrnic

2023b. “Prediction-Powered Inference.”https://arxiv.org/abs/2301.09633.

Angelopoulos

A. N.

Duchi

J. C.

Zrnic

2024. “PPI++: Efficient Prediction-Powered Inference.”https://arxiv.org/abs/2311.01453.

Breidt

F. J.

Opsomer

J. D.

2017. “Model-Assisted Survey Estimation with Modern Prediction Techniques.” Statistical Science 32 (2): 190–205. DOI: https://doi.org/10.1214/16-STS589.

Cassel

C. M.

Särndal

C. E.

Wretman

J. H.

1976. “Some Results on Generalized Difference Estimation and Generalized Regression Estimation for Finite Populations.” Biometrika 63 (3): 615–20. DOI: https://doi.org/10.1093/biomet/63.3.615.

Dagdoug

Goga

Haziza

2023. “Model-Assisted Estimation Through Random Forests in Finite Population Sampling.” Journal of the American Statistical Association 118 (542): 1234–51. DOI: https://doi.org/10.1080/01621459.2021.1987250.

Friedman

J. H.

2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (5): 1189–232. http://www.jstor.org/stable/2699986.

Godfrey

Roshwalb

Wright

R. L.

1984. “Model-Based Stratification in Inventory Cost Estimation.” Journal of Business & Economic Statistics 2 (1): 1–9. DOI: https://doi.org/10.1080/07350015.1984.10509365.

Google. 2025. “Google AI for Developers.”https://ai.google.dev/gemini-api/docs/.

10.

Google DeepMind. 2025. “Gemini.”https://deepmind.google/technologies/gemini/.

11.

Government of Canada. 2025a. “General Index of Financial Information.”https://www.canada.ca/en/revenue-agency/services/forms-publications/publications/rc4088/general-index-financial-information-gifi.html.

12.

Government of Canada. 2025b. “Quarterly Survey of Financial Statements.”https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=2501.

13.

Government of Canada. 2025c. “T4 Statement of Remuneration Paid.”https://www.canada.ca/en/revenue-agency/services/forms-publications/forms/t4.html.

14.

Hofer

R. A.

Maynez

Dhingra

Fisch

Globerson

Cohen

W. W.

2024. “Bayesian Prediction-Powered Inference.”https://arxiv.org/abs/2405.06034.

15.

Kaggle. 2025. “Walmart Grocery Product Dataset.”https://www.kaggle.com/datasets/polartech/walmart-grocery-product-dataset.

16.

Niculescu-Mizil

Caruana

2005. “Predicting Good Probabilities with Supervised Learning.” Proceedings of the 22nd International Conference on Machine Learning, ICML ’05. Association for Computing Machinery. DOI: https://doi.org/10.1145/1102351.1102430.

17.

OpenAI. 2024. “GPT-4o System Card.”https://arxiv.org/abs/2410.21276.

18.

Optuna. 2024. “Optuna.”https://optuna.readthedocs.io/en/stable/installation.html.

19.

Pydantic. 2025. “Pydantic Python Documentation.”https://docs.pydantic.dev/latest/.

20.

Sanguiao-Sande

Zhang

L. C.

2021. “Design-Unbiased Statistical Learning in Survey Sampling.” Sankhyā A 83 (2): 714–44. https://www.jstor.org/stable/48767321.

21.

Särndal

C. E.

Swensson

Wretman

1992. Model Assisted Survey Sampling. 1st ed. Springer.

22.

scikit-learn. 2025a. “Gradient Boosting Regressor.”https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html.

23.

scikit-learn. 2025b. “Isotonic Regression.”https://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html.

24.

Sitter

R. R.

2001. “A Model-Calibration Approach to Using Complete Auxiliary Information from Survey Data.” Journal of the American Statistical Association 96 (453): 185–93. DOI: https://doi.org/10.1198/016214501750333054.

25.

Zrnic

Candès

E. J.

2024a. “Active Statistical Inference.”https://arxiv.org/abs/2403.03208.

26.

Zrnic

Candès

E. J.

2024b. “Cross-Prediction-Powered Inference.” Proceedings of the National Academy of Sciences of the United States of America 121 (15): e2322083121. DOI: https://doi.org/10.1073/pnas.2322083121.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.00 MB

Prediction-Powered Estimation: Unbiased Model-Assisted Estimation

Abstract

Keywords

1. Introduction

2. Background

2.1. Horvitz-Thompson Estimator

2.2. Model-Assisted Estimators

2.3. Difference Estimator and Pre-Trained Models

2.4. Recent Developments

2.5. Prediction-Powered Inference (PPI)

3. Extending PPI for Survey Use: Prediction-Powered Estimator (PPE)

3.1. Inclusion Probabilities for U ∖ s

3.2. The PPE Estimator and Its Variance

4. LLM Use-Cases

4.1. LLM API Details

4.2. Salary Estimation with T4 Forms

4.2.1. Methodology

4.2.2. Results

4.3. NAPCS Grocery Dataset

4.3.1. Methodology

4.3.2. Result

5. Real-World Dataset

5.1. Methodology

5.2. Results

6. Discussion

Supplemental Material

sj-docx-1-jof-10.1177_0282423X261451320 – Supplemental material for Prediction-Powered Estimation: Unbiased Model-Assisted Estimation

Footnotes

Acknowledgements

Funding

ORCID iD

Supplemental Material

References

Supplementary Material

3.1. Inclusion Probabilities for $U ∖ s$