Abstract

1. Introduction
When JOS was founded, inference for finite population parameters was mostly based on probability sample surveys and followed a design-based approach. Even with the advent of the model-assisted framework (Särndal et al. 1992), which opened the way to the use of (assisting) models in survey estimation, there remained a widespread perception that nothing beyond a linear regression model was truly necessary in this context. This view was particularly prevalent in National Statistical Institutes (NSIs), where, in many countries, the production process of official statistics relied primarily on data from sample surveys and weighting systems enhanced through calibration (Deville and Särndal 1992). Although the link between calibration and regression estimation was well recognized, the implicit assumption of a linear assisting model underlying the calibration adjustment was not generally considered an issue. Indeed, calibration estimation offered: (i) coherence with administrative data, population-level auxiliary information, and/or other published statistics; (ii) design-consistent estimates with improved efficiency relative to basic Horvitz-Thompson estimation; and (iii) some protection against nonresponse bias (Lundström and Särndal 1999; Särndal 2007). Moreover, the final set of weights requires auxiliary information only for the sample units, with population aggregates imported from external sources, and can be applied to any characteristic of interest, as it does not depend on the survey variables themselves.
Nonetheless, such a low-risk framework, where the number of choices to be made is relatively limited, is challenged by declining response rates, the need to reduce administration costs and response burden, and to produce more timely statistical information at a higher frequency and a more detailed level. Therefore, over the past decade, a paradigm shift has been advocated for NSIs, moving toward a framework based on the integration of multiple data sources, such as probability and nonprobability surveys, administrative data, big and new data sources. This modernization process requires a more intensive use of modeling tools and implies an increased level of risk coming from an increased number of (modeling) choices. In this setting, the interest in machine learning (ML) methods has grown rapidly, and international initiatives on ML for official statistics were launched, such as the UNECE High-Level Group on Modernisation of Official Statistics, ML Project and, then, the United Kingdom’s Office for National Statistics—UNECE ML Group (all project materials such as reports, codes, data, presentation, papers are available on the UNECE Wiki: UNECE 2023). Discussions on the use of ML in officials statistics have focused also on data processing, classification and coding of textual data (e.g., Measure 2023). The focus, here, is on its role and potential for estimation of population parameters, also for subpopulations, and in the presence of new data sources which may introduce selection bias.
2. Estimation of Finite Population Parameters
To estimate population parameters, prior research extending calibration and regression estimation to non-linear models is particularly useful. The contribution of Breidt and Opsomer (2000), which introduces scatterplot smoothing via local polynomials in regression estimation, may be regarded as seminal in this regard. The model calibration framework proposed by Wu and Sitter (2001), which replaces benchmark constraints on the population totals of auxiliary variables with those on the totals of predicted values derived from parametric nonlinear or generalized linear models, also enabled developments toward greater model flexibility in calibration estimation.
A general framework for deriving model-assisted estimators using data from complex surveys in conjunction with auxiliary information is presented by Breidt and Opsomer (2017). This framework accommodates a wide range of predictive methods, including linear and nonlinear parametric models as well as nonparametric and ML techniques. It supports both regression and model calibration estimation strategies. Examples of nonparametric regression and statistical learning models already implemented within this framework include generalized additive mixed models using penalized splines, neural networks, projection pursuit, k -nearest neighbors, shrinkage methods such as the (adaptive) lasso, regression trees, and ensemble methods such as model averaging, random forests, and bagging (Breidt and Opsomer 2017; Dagdoug et al. 2023; Goga 2024). These approaches, although still design-based, are typically highly variable-specific and, in order to be computed, often require complete auxiliary information that is, that auxiliary variables be known for every unit in the population. Consequently, their use in large-scale, multipurpose surveys has remained limited, with notable exceptions including environmental and forest inventory surveys (e.g., Baffetta et al. 2009; Opsomer et al. 2007). However, following the modernization call, many NSIs, particularly in Europe, are working to create population register systems by integrating data from diverse sources and this has the potential to provide rich frames with unit-level auxiliary information to be used effectively with ML methods. In the area of estimation of population parameters, further research is needed for improved variance estimation of final estimators, as the analytic approach available now and based on sample residuals can underestimate the true variance if the ML is not trained properly and overfits the data.
Small area estimation can improve the level of detail, frequency, and timeliness of official statistics information, without increasing sample sizes and thus data collection costs and ML methods are natural candidates for this essentially model-based approach. Tzavidis (2025) provides a discussion of opportunities and challenges of ML for small area estimation for official statistics, while Molina (2024) offers an overview of ML methods for small area estimation with a particular focus on wealth indicators. Indeed, there is a large literature on ML algorithms for predicting poverty using satellite or aerial images, particularly in countries with limited availability of administrative data and/or large scale sample surveys. In these applications, ML algorithms are applied to establish the relation between survey data and mobile phone data or sensor data such as night-time light intensity from satellite images; then predictions for poverty for finer regional levels are obtained by applying the algorithm to the sensor data that are assumed to cover the population of interest. See van den Brakel (2022, 2025) and Hall et al. (2023) for recent reviews. All these methods are synthetic, in the sense that they do not include area effects and, therefore, assume that the available covariates paired with flexible ML methods explain all the between-area heterogeneity. The only exception is in Krennmair and Schmid (2022), where mixed effects random forests are used in order to combine advantages of regression forests with the ability to model hierarchical/clustered data. An alternative approach that works within a ML perspective is proposed in Parker (2024), where random weight neural networks are used to extend the Fay-Herriot model. In a sense, we can note that research has moved along two different paths that could benefit from more reciprocal contamination. As for general estimation, also for small area estimation further research is needed to obtain reliable estimates of the MSE of final estimates obtained using ML models.
3. New Sources of Data: Accuracy and Selection Bias
The modernization process with the use of register data and of new sources of data requires extensive use of predictive models for imputing missing data, making it a natural field for ML methods that can handle high-dimensional, unstructured, and complex data sources. This renewed interest in mass imputation over weighting techniques raises several methodological and ethical issues, including concerns related to model transparency and the communication of prediction uncertainty. The latter, in particular, opens up new avenues for methodological research. In fact, when one predicts a target variable with a prediction (ML) model for at least some values in the population, then one should communicate to the final user to be aware that these values cannot be considered equal to error-free observations, which is very tempting to do. In this setting, the prediction error for the single predicted value should be evaluated and also incorporated in final statistics. Attempts in this directions for parametric models can be found in Alleva et al. (2021) and in Deliu et al. (2025), and could be extended to ML methods. Alternative approaches, more suited to ML, may look a replication methods.
An interesting direction to explore is that of conformal prediction, which is an assumption-lean approach to generating distribution-free prediction intervals or sets, for nearly arbitrary predictive models, with guaranteed finite-sample coverage. Conformal methods are an active research topic in statistics and ML, but only recently have they been considered for non-exchangeable data, and therefore, brought to the attention to handle design-based inference for a finite population in Wieczorek (2023). Deliu and Liseo (2025) look at the relevance and applicability of conformal prediction for register based official statistics and provide a starting point for research in the field to be enhanced with ML models. Bersson and Hoff (2024) develop an area-level prediction region procedure for small area estimation using a conformal prediction approach and provide a seminal approach that can be enhanced using ML models in future research.
A relevant field of application of ML tools at NSIs has been with web-scraped data. Barcaroli et al. (2015) use these data together with text-mining to produce experimental statistics on the use of ICT and of Internet by Italian enterprises. Similarly, Daas and van der Doef (2020) use web-scraped data to predict the number of innovative companies in the Netherlands: sample data obtained from the Dutch Community Innovation Survey are used to annotate web-scraped data from business web sites. This annotated data set is used to train a ML algorithm, which in turn is used to predict the total number of innovative business in the Netherlands.
The use of these new sources of data, including non-probability convenience samples, sensor or mobile phone data, and satellite images may introduce the issue of selection bias. In particular, when the variables of interest are not measured on a probability sample, several approaches are proposed in the literature to adjust for selection bias. Many are closely related to methods used to adjust for nonresponse in probability samples. These approaches consider the case in which a high-quality probability sample or a census can be used to adjust for the selection bias, because it measures a common set of auxiliary variables with the biased source (see e.g., Valliant 2024; Wu 2022). In this setting, one can (i) estimate the probability of being in the biased source of data as a function of these auxiliary variables and, then, use an inverse probability weighting approach, or (ii) learn a model that links the study variables with the auxiliary variables on the biased source and then take a mass-imputation approach by predicting the variables of interest in the high-quality probability sample/census, or (iii) combine approaches (i) and (ii) in a doubly- or multiply-robust approach in the spirit of Chen and Haziza (2017). Including random area effects, a doubly robust approach has been also used to integrate probability and nonprobability data in order to adjust for selection bias and obtain small area estimates (see e.g., Schirripa Spagnolo et al. 2025). All these approaches can greatly benefit from the use of ML methods, as already shown in some studies (Castro-Martín et al. 2021, 2022; Ferri-García et al. 2021, 2024; Rueda et al. 2023).
A different and promising perspective is taken in Lee et al. (2023), where a class of robust estimators is proposed by using many ML techniques as models for the outcome variable trained on a subsample, computing their prediction errors on the hold-out subsample of the respondents, and projecting these errors to the non-respondents under a cell mean response probability model. The final proposed estimator is a weighted average of those multiple estimators, with weights given by a subsampling Rao-Blackwell, using a robust extension of the randomization-based approach of unbiased statistical learning proposed by Sande and Zhang (2021) and further extended in Zhang et al. (2025).
Another avenue for research in this context is the extension of the approach proposed in Zhang (2021) to replace costly and burdensome surveys by non-survey big-data sources. In particular, Zhang (2021) proposes to use scanner data to compile the Consumer Price Index weights, and use the household expenditure survey as an audit sample to assess the accuracy of the scanner data-based weights. In fact, given the very large amount of data, the bias completely dominates the variance in scanner data so that a test for assessing the accuracy, and also a measure for the uncertainty, of these weights is proposed. Particularly valuable and worthy further research is the proposal of the evaluation coverage as a novel accuracy measure, which provides the means to assess the big-data bias against the costs of alternative and possibly unbiased estimation methods.
4. Concluding Remarks
The modernization process invoked for Official Statistics gives new impetus for the use of ML methods. Here we have focused on their potential for finite population parameter estimation, imputation, and data integration. The flexibility afforded by these methods necessitates proper evaluation of the accuracy of the final outputs. Further research is needed to improve variance estimation for ML-enhanced model-assisted estimators, small area estimators, and imputed register values. Replication-based methods, as well as conformal prediction approaches, represent valuable alternatives to analytical methods.
The need to make full use of non-probability sources of data paves the way for the adoption of non-design-based estimation methods. In this context, assessing the presence of selection bias introduced by such data sources—and accordingly adjusting the final estimates—is of paramount importance to ensure valid inference. The approach proposed in Zhang (2021) opens new avenues for research in using audit samples to assess this bias. ML models can enhance multiply robust data integration methods and make the assumption of missing at random data more tenable. Nonetheless, more research is needed to handle more complex selection mechanisms. The randomization-based approach of statistical learning proposed by Sande and Zhang (2021) and Zhang et al. (2025) offers a distinct and promising perspective on debiasing these data sources.
Open-source software packages that offer a range of ML algorithms in a standardized framework—such as the caret package in R—facilitate the application of these tools across various areas of official statistics. However, fine-tuning these methodologies is of paramount importance, and the necessary expertise must therefore be developed or acquired within National Statistical Institutes (NSIs). van Delden et al. (2023) discusses key issues concerning the use of ML in a statistical context, including the processes of selecting an ML algorithm, training and testing models, and evaluating their performance. Moreover, running these algorithms on the large datasets typically found in NSIs can be highly computationally intensive. As cloud services may not be an option due to privacy concerns, increasing in-house computing capacity becomes essential.
Certainly, ML has already demonstrated its potential for official statistics. Nonetheless, compared with the traditional approach adopted by NSIs—based on probability sampling combined with design-based inference methods—ML entails a significantly higher degree of risk, particularly when applied to new data sources. This underscores the need to adjust the quality dimensions of existing quality assurance frameworks, such as the Quality Assurance Framework of the European Statistical System. Further research is needed in this direction, building on the proposals by De Broe et al. (2021), Puts and Daas (2021), and Puts et al. (2024), which aim to incorporate the increasing use of ML models into classical quality assessment frameworks.
Footnotes
Acknowledgements
I would like to dedicate this contribution to the memory of Piet Daas, who passed away while I was working on these pages. He made significant contributions to the effective use of Machine Learning in Official Statistics and to the discussion of its advantages and pitfalls. He will be deeply missed, as a statistician and as a person.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work has been developed under the support of European Union—Next Generation EU, Mission 4 Component 1 CUP J33C22002910001—Project INCLUDE-ALL: social INCLUsion and Digital indicators Estimation At Local Level.
Received: April 16, 2025
Accepted: May 18, 2025
