Domain estimation from weighted nonprobability samples

Abstract

When inferring population characteristics from a nonprobability sample, it is crucial to correct the possible selection bias therein by, for example, pseudo-weighting. Many correction methods focus on estimating the population means of the target variable. However, often the quantities of subpopulations are also of interest. It is unclear whether pseudo-weights are suitable for domain estimation, since the weights unavoidably introduce variation and possibly even bias in the downstream estimation.

To address this issue, modeling on the domain level may be an option. We evaluate two promising domain estimation methods on weighted nonprobability samples. The first one is iterative proportional fitting (IPF), where the margins are considered in the domain estimation, so that the marginal values may be fixed when improving the domain estimates. The other is a hierarchical Bayesian model, in which the pseudo-weights are included in the domain modeling process. This approach enjoys the flexibility of modeling when different types of information are available. We evaluate a range of modeling options for the two methods, and compare them in a simulation study. We also evaluate the methods with resampled real data sets to mimic the scenario where the relation between variables and the inclusion mechanism of the nonprobability samples are unknown to the researchers.

We found that applying IPF to the unweighted table and the hierarchical Bayesian model improves the domain estimation in most cases. If both marginal and domain estimates are of interest, the estimated overall population total or mean should be considered in the domain modeling process.

Keywords

Nonprobability sample domain estimation pseudo weighting iterative proportional fitting hierarchical bayesian model

Get full access to this article

View all access options for this article.

References

Rao

JNK

. On making valid inferences by integrating data from surveys and other sources. Sankhya B 2021; 83: 242–272. DOI: 10.1007/s13571-020-00227-w.

Elliott

Valliant

. Inference for nonprobability samples. Stat Sci 2017; 32: 249–264. DOI: 10.1214/16-STS598.

Rao

JNK

Molina

. Small area estimation. Hoboken, NJ: John Wiley & Sons, 2015.

Liu

Scholtus

De Waal

. Correcting selection bias in big data by pseudo weighting. J Surv Stat Methodol 2023; 11: 1181–1203. DOI: 10.1093/jssam/smac029.

Pfeffermann

. New important developments in small area estimation. Stat Sci 2013; 28: 40–68.

Parker

Janicki

Holan

. A comprehensive overview of unit-level modeling of survey data for small area estimation under informative sampling. J Surv Stat Methodol 2023; 11: 829–857.

Ireland

Kullback

. Contingency tables with given marginals. Biometrika 1968; 55: 179–188.

Little

RJA

. Models for contingency tables with known margins when target and sampled populations differ. J Am Stat Assoc 1991; 86: 87–95.

Villalobos-Alíste

Scholtus

de Waal

. Combining probability and non-probability samples on an aggregated level. J Off Stat 2025; 41: 619–648.

10.

Vandendijck

Faes

Kirby

, et al. Model-based inference for small area estimation with sampling weights. Spat Stat 2016; 18: 455–473.

11.

. Statistical inference with non-probability survey samples. Surv Methodol 2022; 48: 283–311.

12.

Bishop

Fienberg

Holland

. Discrete multivariate analysis theory and practice. Cambridge, MA: Springer, 1975.

13.

Pillai

Gelman

. Bayesian nonparametric weighted sampling inference. Bayes Anal 2015; 10: 605–625.

14.

Barthélemy

Suesse

. mipfp: An R package for multidimensional array fitting and simulating multivariate Bernoulli distributions. J Stat Software Code Snippets 2018; 86: 1–20.

15.

Rue

Martino

Chopin

. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. J R Stat Soc Ser B: Stat Methodol 2009; 71: 319–392.

16.

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. https://www.R-project.org/.

17.

Czapinski

Panek

. Social diagnosis 2011: objective and subjective quality of life in poland-full report. Contemp Econ 2011; 5: 113–185.

18.

Liu

Scholtus

Van Deun

, et al. Performance measures for sample selection bias correction by weighting. J Off Stat 2025; 41: 675–699.