Sage Journals: Discover world-class research

Abstract

Keywords

paradigm shift conformal prediction quality framework small area estimation data integration

1. Introduction

When JOS was founded, inference for finite population parameters was mostly based on probability sample surveys and followed a design-based approach. Even with the advent of the model-assisted framework (Särndal et al. 1992), which opened the way to the use of (assisting) models in survey estimation, there remained a widespread perception that nothing beyond a linear regression model was truly necessary in this context. This view was particularly prevalent in National Statistical Institutes (NSIs), where, in many countries, the production process of official statistics relied primarily on data from sample surveys and weighting systems enhanced through calibration (Deville and Särndal 1992). Although the link between calibration and regression estimation was well recognized, the implicit assumption of a linear assisting model underlying the calibration adjustment was not generally considered an issue. Indeed, calibration estimation offered: (i) coherence with administrative data, population-level auxiliary information, and/or other published statistics; (ii) design-consistent estimates with improved efficiency relative to basic Horvitz-Thompson estimation; and (iii) some protection against nonresponse bias (Lundström and Särndal 1999; Särndal 2007). Moreover, the final set of weights requires auxiliary information only for the sample units, with population aggregates imported from external sources, and can be applied to any characteristic of interest, as it does not depend on the survey variables themselves.

Nonetheless, such a low-risk framework, where the number of choices to be made is relatively limited, is challenged by declining response rates, the need to reduce administration costs and response burden, and to produce more timely statistical information at a higher frequency and a more detailed level. Therefore, over the past decade, a paradigm shift has been advocated for NSIs, moving toward a framework based on the integration of multiple data sources, such as probability and nonprobability surveys, administrative data, big and new data sources. This modernization process requires a more intensive use of modeling tools and implies an increased level of risk coming from an increased number of (modeling) choices. In this setting, the interest in machine learning (ML) methods has grown rapidly, and international initiatives on ML for official statistics were launched, such as the UNECE High-Level Group on Modernisation of Official Statistics, ML Project and, then, the United Kingdom’s Office for National Statistics—UNECE ML Group (all project materials such as reports, codes, data, presentation, papers are available on the UNECE Wiki: UNECE 2023). Discussions on the use of ML in officials statistics have focused also on data processing, classification and coding of textual data (e.g., Measure 2023). The focus, here, is on its role and potential for estimation of population parameters, also for subpopulations, and in the presence of new data sources which may introduce selection bias.

2. Estimation of Finite Population Parameters

To estimate population parameters, prior research extending calibration and regression estimation to non-linear models is particularly useful. The contribution of Breidt and Opsomer (2000), which introduces scatterplot smoothing via local polynomials in regression estimation, may be regarded as seminal in this regard. The model calibration framework proposed by Wu and Sitter (2001), which replaces benchmark constraints on the population totals of auxiliary variables with those on the totals of predicted values derived from parametric nonlinear or generalized linear models, also enabled developments toward greater model flexibility in calibration estimation.

A general framework for deriving model-assisted estimators using data from complex surveys in conjunction with auxiliary information is presented by Breidt and Opsomer (2017). This framework accommodates a wide range of predictive methods, including linear and nonlinear parametric models as well as nonparametric and ML techniques. It supports both regression and model calibration estimation strategies. Examples of nonparametric regression and statistical learning models already implemented within this framework include generalized additive mixed models using penalized splines, neural networks, projection pursuit, k -nearest neighbors, shrinkage methods such as the (adaptive) lasso, regression trees, and ensemble methods such as model averaging, random forests, and bagging (Breidt and Opsomer 2017; Dagdoug et al. 2023; Goga 2024). These approaches, although still design-based, are typically highly variable-specific and, in order to be computed, often require complete auxiliary information that is, that auxiliary variables be known for every unit in the population. Consequently, their use in large-scale, multipurpose surveys has remained limited, with notable exceptions including environmental and forest inventory surveys (e.g., Baffetta et al. 2009; Opsomer et al. 2007). However, following the modernization call, many NSIs, particularly in Europe, are working to create population register systems by integrating data from diverse sources and this has the potential to provide rich frames with unit-level auxiliary information to be used effectively with ML methods. In the area of estimation of population parameters, further research is needed for improved variance estimation of final estimators, as the analytic approach available now and based on sample residuals can underestimate the true variance if the ML is not trained properly and overfits the data.

Small area estimation can improve the level of detail, frequency, and timeliness of official statistics information, without increasing sample sizes and thus data collection costs and ML methods are natural candidates for this essentially model-based approach. Tzavidis (2025) provides a discussion of opportunities and challenges of ML for small area estimation for official statistics, while Molina (2024) offers an overview of ML methods for small area estimation with a particular focus on wealth indicators. Indeed, there is a large literature on ML algorithms for predicting poverty using satellite or aerial images, particularly in countries with limited availability of administrative data and/or large scale sample surveys. In these applications, ML algorithms are applied to establish the relation between survey data and mobile phone data or sensor data such as night-time light intensity from satellite images; then predictions for poverty for finer regional levels are obtained by applying the algorithm to the sensor data that are assumed to cover the population of interest. See van den Brakel (2022, 2025) and Hall et al. (2023) for recent reviews. All these methods are synthetic, in the sense that they do not include area effects and, therefore, assume that the available covariates paired with flexible ML methods explain all the between-area heterogeneity. The only exception is in Krennmair and Schmid (2022), where mixed effects random forests are used in order to combine advantages of regression forests with the ability to model hierarchical/clustered data. An alternative approach that works within a ML perspective is proposed in Parker (2024), where random weight neural networks are used to extend the Fay-Herriot model. In a sense, we can note that research has moved along two different paths that could benefit from more reciprocal contamination. As for general estimation, also for small area estimation further research is needed to obtain reliable estimates of the MSE of final estimates obtained using ML models.

3. New Sources of Data: Accuracy and Selection Bias

The modernization process with the use of register data and of new sources of data requires extensive use of predictive models for imputing missing data, making it a natural field for ML methods that can handle high-dimensional, unstructured, and complex data sources. This renewed interest in mass imputation over weighting techniques raises several methodological and ethical issues, including concerns related to model transparency and the communication of prediction uncertainty. The latter, in particular, opens up new avenues for methodological research. In fact, when one predicts a target variable with a prediction (ML) model for at least some values in the population, then one should communicate to the final user to be aware that these values cannot be considered equal to error-free observations, which is very tempting to do. In this setting, the prediction error for the single predicted value should be evaluated and also incorporated in final statistics. Attempts in this directions for parametric models can be found in Alleva et al. (2021) and in Deliu et al. (2025), and could be extended to ML methods. Alternative approaches, more suited to ML, may look a replication methods.

An interesting direction to explore is that of conformal prediction, which is an assumption-lean approach to generating distribution-free prediction intervals or sets, for nearly arbitrary predictive models, with guaranteed finite-sample coverage. Conformal methods are an active research topic in statistics and ML, but only recently have they been considered for non-exchangeable data, and therefore, brought to the attention to handle design-based inference for a finite population in Wieczorek (2023). Deliu and Liseo (2025) look at the relevance and applicability of conformal prediction for register based official statistics and provide a starting point for research in the field to be enhanced with ML models. Bersson and Hoff (2024) develop an area-level prediction region procedure for small area estimation using a conformal prediction approach and provide a seminal approach that can be enhanced using ML models in future research.

A relevant field of application of ML tools at NSIs has been with web-scraped data. Barcaroli et al. (2015) use these data together with text-mining to produce experimental statistics on the use of ICT and of Internet by Italian enterprises. Similarly, Daas and van der Doef (2020) use web-scraped data to predict the number of innovative companies in the Netherlands: sample data obtained from the Dutch Community Innovation Survey are used to annotate web-scraped data from business web sites. This annotated data set is used to train a ML algorithm, which in turn is used to predict the total number of innovative business in the Netherlands.

The use of these new sources of data, including non-probability convenience samples, sensor or mobile phone data, and satellite images may introduce the issue of selection bias. In particular, when the variables of interest are not measured on a probability sample, several approaches are proposed in the literature to adjust for selection bias. Many are closely related to methods used to adjust for nonresponse in probability samples. These approaches consider the case in which a high-quality probability sample or a census can be used to adjust for the selection bias, because it measures a common set of auxiliary variables with the biased source (see e.g., Valliant 2024; Wu 2022). In this setting, one can (i) estimate the probability of being in the biased source of data as a function of these auxiliary variables and, then, use an inverse probability weighting approach, or (ii) learn a model that links the study variables with the auxiliary variables on the biased source and then take a mass-imputation approach by predicting the variables of interest in the high-quality probability sample/census, or (iii) combine approaches (i) and (ii) in a doubly- or multiply-robust approach in the spirit of Chen and Haziza (2017). Including random area effects, a doubly robust approach has been also used to integrate probability and nonprobability data in order to adjust for selection bias and obtain small area estimates (see e.g., Schirripa Spagnolo et al. 2025). All these approaches can greatly benefit from the use of ML methods, as already shown in some studies (Castro-Martín et al. 2021, 2022; Ferri-García et al. 2021, 2024; Rueda et al. 2023).

A different and promising perspective is taken in Lee et al. (2023), where a class of robust estimators is proposed by using many ML techniques as models for the outcome variable trained on a subsample, computing their prediction errors on the hold-out subsample of the respondents, and projecting these errors to the non-respondents under a cell mean response probability model. The final proposed estimator is a weighted average of those multiple estimators, with weights given by a subsampling Rao-Blackwell, using a robust extension of the randomization-based approach of unbiased statistical learning proposed by Sande and Zhang (2021) and further extended in Zhang et al. (2025).

Another avenue for research in this context is the extension of the approach proposed in Zhang (2021) to replace costly and burdensome surveys by non-survey big-data sources. In particular, Zhang (2021) proposes to use scanner data to compile the Consumer Price Index weights, and use the household expenditure survey as an audit sample to assess the accuracy of the scanner data-based weights. In fact, given the very large amount of data, the bias completely dominates the variance in scanner data so that a test for assessing the accuracy, and also a measure for the uncertainty, of these weights is proposed. Particularly valuable and worthy further research is the proposal of the evaluation coverage as a novel accuracy measure, which provides the means to assess the big-data bias against the costs of alternative and possibly unbiased estimation methods.

4. Concluding Remarks

The modernization process invoked for Official Statistics gives new impetus for the use of ML methods. Here we have focused on their potential for finite population parameter estimation, imputation, and data integration. The flexibility afforded by these methods necessitates proper evaluation of the accuracy of the final outputs. Further research is needed to improve variance estimation for ML-enhanced model-assisted estimators, small area estimators, and imputed register values. Replication-based methods, as well as conformal prediction approaches, represent valuable alternatives to analytical methods.

The need to make full use of non-probability sources of data paves the way for the adoption of non-design-based estimation methods. In this context, assessing the presence of selection bias introduced by such data sources—and accordingly adjusting the final estimates—is of paramount importance to ensure valid inference. The approach proposed in Zhang (2021) opens new avenues for research in using audit samples to assess this bias. ML models can enhance multiply robust data integration methods and make the assumption of missing at random data more tenable. Nonetheless, more research is needed to handle more complex selection mechanisms. The randomization-based approach of statistical learning proposed by Sande and Zhang (2021) and Zhang et al. (2025) offers a distinct and promising perspective on debiasing these data sources.

Open-source software packages that offer a range of ML algorithms in a standardized framework—such as the caret package in R—facilitate the application of these tools across various areas of official statistics. However, fine-tuning these methodologies is of paramount importance, and the necessary expertise must therefore be developed or acquired within National Statistical Institutes (NSIs). van Delden et al. (2023) discusses key issues concerning the use of ML in a statistical context, including the processes of selecting an ML algorithm, training and testing models, and evaluating their performance. Moreover, running these algorithms on the large datasets typically found in NSIs can be highly computationally intensive. As cloud services may not be an option due to privacy concerns, increasing in-house computing capacity becomes essential.

Certainly, ML has already demonstrated its potential for official statistics. Nonetheless, compared with the traditional approach adopted by NSIs—based on probability sampling combined with design-based inference methods—ML entails a significantly higher degree of risk, particularly when applied to new data sources. This underscores the need to adjust the quality dimensions of existing quality assurance frameworks, such as the Quality Assurance Framework of the European Statistical System. Further research is needed in this direction, building on the proposals by De Broe et al. (2021), Puts and Daas (2021), and Puts et al. (2024), which aim to incorporate the increasing use of ML models into classical quality assessment frameworks.

Footnotes

Acknowledgements

I would like to dedicate this contribution to the memory of Piet Daas, who passed away while I was working on these pages. He made significant contributions to the effective use of Machine Learning in Official Statistics and to the discussion of its advantages and pitfalls. He will be deeply missed, as a statistician and as a person.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work has been developed under the support of European Union—Next Generation EU, Mission 4 Component 1 CUP J33C22002910001—Project INCLUDE-ALL: social INCLUsion and Digital indicators Estimation At Local Level.

ORCID iD

M. Giovanna Ranalli

Received: April 16, 2025

Accepted: May 18, 2025

References

Alleva

Falorsi

P. D.

Petrarca

Righi

2021. “Measuring the Accuracy of Aggregates Computed from a Statistical Register.”Journal of Official Statistics 37 (2): 481–503. DOI: https://doi.org/10.2478/jos-2021-0021.

Baffetta

Fattorini

Franceschi

Corona

2009. “Design-Based Approach to K-Nearest Neighbours Technique for Coupling Field and Remotely Sensed Data in Forest Surveys.”Remote Sensing of Environment 113 (3): 463–75. DOI: https://doi.org/10.1016/j.rse.2008.06.014.

Barcaroli

Nurra

Salamone

Scannapieco

Scarnò

Summa

2015. “Internet as Data Source in the Istat Survey on ICT in Enterprises.”Austrian Journal of Statistics 44 (2): 31–43. DOI: https://doi.org/10.17713/ajs.v44i2.53.

Bersson

Hoff

P. D.

2024. “Optimal Conformal Prediction for Small Areas.”Journal of Survey Statistics and Methodology 12 (5): 1464–88. DOI: https://doi.org/10.1093/jssam/smae010.

Breidt

F. J.

Opsomer

J. D.

2000. “Local Polynomial Regression Estimators in Survey Sampling.”Annals of Statistics 28: 1026–53. http://www.jstor.org/stable/2673953.

Breidt

F. J.

Opsomer

J. D.

2017. “Model-Assisted Survey Estimation with Modern Prediction Techniques.”Statistical Science 32: 190–205. DOI: https://doi.org/10.1214/16-STS589.

Castro-Martín

del Mar Rueda

Ferri-García

2022. “Combining Statistical Matching and Propensity Score Adjustment for Inference from Non-Probability Surveys.”Journal of Computational and Applied Mathematics 404: 113414. DOI: https://doi.org/10.1016/j.cam.2021.113414.

Castro-Martín

del

Rueda

Ferri-García

Hernando-Tamayo

. 2021. “On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures.”Mathematics 9 (23): 2991. DOI: https://doi.org/10.3390/math9232991.

Chen

Haziza

2017. “Multiply Robust Imputation Procedures for the Treatment of Item Nonresponse in Surveys.”Biometrika 104 (2): 439–53. DOI: https://doi.org/10.1093/biomet/asx007.

10.

Daas

P. J.

van der Doef

2020. “Detecting Innovative Companies Via Their Website.”Statistical Journal of the IAOS 36 (4): 1239–51. DOI: https://doi.org/10.3233/SJI-200627.

11.

Dagdoug

Goga

Haziza

2023. “Model-Assisted Estimation Through Random Forests in Finite Population Sampling.”Journal of the American Statistical Association 118 (542): 1234–51. DOI: https://doi.org/10.1080/01621459.2021.1987250.

12.

De Broe

Struijs

Daas

, et al. 2021. “Updating the Paradigm of Official Statistics: New Quality Criteria for Integrating New Data and Methods in Official Statistics.”Statistical Journal of the IAOS 37 (1): 343–60. DOI: https://doi.org/10.3233/SJI-200711.

13.

Deliu

Falorsi

P. D.

Falorsi

Chianella

Alleva

2025. “Assessing the Accuracy of Multisource Register-Based Official Statistics for Multinomial Outcomes.”arXiv preprint arXiv:2502.10182. DOI: https://doi.org/10.48550/arXiv.2502.10182.

14.

Deliu

Liseo

2025. “Conformal Methods in Official Statistics: Perspectives and Challenges.”Proceedings of the III Workshop on Methodologies for Official Statistics, Rome, Italy, December 4–5, 2024.

15.

Deville

J.-C.

Särndal

C.-E.

1992. “Calibration Estimators in Survey Sampling.”Journal of the American Statistical Association 87 (418): 376–82. DOI: https://doi.org/10.1080/01621459.1992.10475217.

16.

Ferri-García

Castro-Martín

del M. Rueda

2021. “Evaluating Machine Learning Methods for Estimation in Online Surveys with Superpopulation Modeling.”Mathematics and Computers in Simulation 186: 19–28. DOI: https://doi.org/10.1016/j.matcom.2020.03.005.

17.

Ferri-García

Rueda-Sánchez

J. L.

del M. Rueda

Cobo

2024. “Estimating Response Propensities in Nonprobability Surveys Using Machine Learning Weighted Models.”Mathematics and Computers in Simulation 225: 779–93. DOI: https://doi.org/10.1016/j.matcom.2024.06.012.

18.

Goga

2024. “From Traditional to Modern Machine Learning Estimation Methods for Survey Sampling.”The Survey Statistician 90: 18–29. https://isi-iass.org/home/wp-content/uploads/Survey_Statistician_2024_July_N90_03.pdf

19.

Hall

Dompae

Wahab

Dzanku

F. M.

2023. “A Review of Machine Learning and Satellite Imagery for Poverty Prediction: Implications for Development Research and Applications.”Journal of International Development 35 (7): 1753–68. DOI: https://doi.org/10.1002/jid.3751.

20.

Krennmair

Schmid

2022. “Flexible Domain Prediction Using Mixed Effects Random Forests.”Journal of the Royal Statistical Society: Series C (Applied Statistics) 71 (5): 1865–94. DOI: https://doi.org/10.1111/rssc.12600.

21.

Lee

Zhang

L.-C.

Chen

2023. “Robust Quasi-Randomization-Based Estimation with Ensemble Learning for Missing Data.”Scandinavian Journal of Statistics 50 (3): 1263–78. DOI: https://doi.org/10.1111/sjos.12626.

22.

Lundström

Särndal

C.-E.

1999. “Calibration as a Standard Method for Treatment of Nonresponse.”Journal of Official Statistics 15 (2): 305.

23.

Measure

2023. “Six Years of Machine Learning in the Bureau of Labor Statistics.” In Advances in Business Statistics, Methods and Data Collection, edited by Snijkers

Bavdaž

Bender

, et al. John Wiley & Sons, Inc.

24.

Molina

2024. Frontiers in Small Area Estimation Research. World Bank.

25.

Opsomer

J. D.

Breidt

F. J.

Moisen

G. G.

Kauermann

2007. “Model-Assisted Estimation of Forest Resources with Generalized Additive Models.”Journal of the American Statistical Association 102 (478): 400–9. DOI: https://doi.org/10.1198/0162 14506000001491.

26.

Parker

P. A.

2024. “Nonlinear Fay-Herriot Models for Small Area Estimation Using Random Weight Neural Networks.”Journal of Official Statistics 40 (2): 317–32. DOI: https://doi.org/10.1177/0282423X241244671.

27.

Puts

Daas

2021. “Machine Learning from the Perspective of Official Statistic.”The Survey Statistician 84: 12–7.

28.

Puts

Salgado

Daas

2024. “Leveraging Machine Learning for Official Statistics: A Statistical Manifesto.”arXiv preprint arXiv:2409.04365. DOI: https://doi.org/10.48550/arXiv.2409.04365.

29.

Rueda

M. Del M.

Pasadas-del Amo

Rodríguez

B. C.

Castro-Martín

Ferri-García

2023. “Enhancing Estimation Methods for Integrating Probability and Nonprobability Survey Samples with Machine-Learning Techniques. An Application to a Survey on the Impact of the COVID-19 Pandemic in Spain.”Biometrical Journal 65 (2): 2200035. DOI: https://doi.org/10.1002/bimj.202200035.

30.

Sande

L. S.

Zhang

L.-C.

2021. “Design-Unbiased Statistical Learning in Survey Sampling.”Sankhya A 83 (2): 714–44. DOI: https://doi.org/10.1007/s13171-020-00224-1.

31.

Särndal

C.-E.

2007. “The Calibration Approach in Survey Theory and Practice.”Survey Methodology 33 (2): 99–119.

32.

Särndal

C.-E.

Swensson

Wretman

1992. Model Assisted Survey Sampling. Springer Science & Business Media.

33.

Schirripa Spagnolo

Bertarelli

Summa

, et al. 2025. “Inference for Big Data Assisted by Small Area Methods: An Application on Sustainable Development Goals Sensitivity of Enterprises in Italy.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 118: 27–45. DOI: https://doi.org/10.1093/jrsssa/qnae115.

34.

Tzavidis

2025. “Small Area Estimation in the Era of Machine Learning and Alternative Data Sources: Opportunities, Challenges and Outlook.”Journal of Official Statistics.

35.

UNECE. 2023. “Machine Learning for Official Statistics Home.”https://statswiki.unece.org/spaces/ML/pages/266142512/Machine+Learning+for+Official+Statistics+Home (accessed May 19, 2025).

36.

Valliant

. 2024. “Hansen Lecture 2022: The Evolution of the Use of Models in Survey Sampling.”Journal of Survey Statistics and Methodology 12 (2): 275–304. DOI: https://doi.org/10.1093/jssam/smad021.

37.

Van Delden

Burger

Puts

2023. “Ten Propositions on Machine Learning in Official Statistics.”AStA Wirtschafts-und Sozialstatistisches Archiv 17 (3): 195–221. DOI: https://doi.org/10.1007/s11943-023-00330-0.

38.

Van den Brakel

. 2022. “New Data Sources and Inference Methods for Official Statistics.” In Statistics in the Public Interest: In Memory of Stephen E. Fienberg, edited by Carriquiry

A. L.

Tanur

J. M.

Eddy

W. F.

Springer.

39.

Van den Brakel

. 2025. “New Data Sources for Official Statistics.” Proceedings of the III Workshop on Methodologies for Official Statistics, Rome, Italy, December 4–5, 2024.

40.

Wieczorek

2023. “Design-Based Conformal Prediction.”Survey Methodology 49 (2): 443–73.

41.

2022. “Statistical Inference with Non-Probability Survey Samples.”Survey Methodology 48 (2): 283–311.

42.

Sitter

R. R.

2001. “A Model-Calibration Approach to Using Complete Auxiliary Information from Survey Data.”Journal of the American Statistical Association 96 (453): 185–93. DOI: https://doi.org/10.1198/016214501750333054.

43.

Zhang

L.-C.

2021. “Proxy Expenditure Weights for Consumer Price Index: Audit Sampling Inference for Big-Data Statistics.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 184 (2): 571–88. DOI: https://doi.org/10.1111/rssa.12632.

44.

Zhang

L.-C.

Sanguiao-Sande

Lee

2025. “Design-Based Predictive Inference.”Journal of Official Statistics 41 (1): 404–32. DOI: https://doi.org/10.1177/0282423 X241277719.

Machine Learning Methods for Estimation in Official Statistics

Abstract

Keywords

1. Introduction

2. Estimation of Finite Population Parameters

3. New Sources of Data: Accuracy and Selection Bias

4. Concluding Remarks

Footnotes

Acknowledgements

Funding

ORCID iD

References