Sage Journals: Discover world-class research

Abstract

Machine Learning (ML) models often achieve high accuracy, but fail to meet the reliability and robustness standards required for Official Statistics (OS). Neural networks, in particular, function as black-box predictors prone to overconfidence, offering no direct method to measure true uncertainty in predictions and estimates of population parameters. Non-rigorous approaches include treating ML predictions as gold standard data and heuristic notions of uncertainty, like softmax scores in classification problems, as valid measures of confidence. This can easily lead to unreliable uncertainty quantification. This paper handles two distinct problems: (1) quantifying prediction-level uncertainty for new observations and (2) quantifying noise-free uncertainty for estimates of population parameters. We propose handling the former via conformal prediction (CP) and the latter using prediction-powered inference (PPI). Both are model-agnostic statistical frameworks for uncertainty quantification. Finally, we present real-world use cases for OS, applying both techniques.

Keywords

uncertainty quantification conformal prediction prediction-powered inference

1. Introduction

The production of Official Statistics is increasingly reliant on machine learning,^1,2 especially to handle data imputation and non-traditional data sources. However, while flexible and often accurate, Machine Learning (ML) models, and especially Deep Neural Networks (DNNs), often fail to meet important statistical requirements necessary for rigorous prediction and inference, Uncertainty Quantification (UQ) in model predictions and inference on population parameters. A well-documented issue with DNNs is their tendency to produce overconfident estimates for probability distributions, especially with out-of-distribution (OOD) data.³ This is problematic, as DNNs can be just as confident in incorrect predictions as they are in correct ones. As a result, if we solely rely on heuristic notions of uncertainty, such as softmax scores in classification problems that approximate a probability distribution over classes but do not quantify the true uncertainty,⁴ we risk incorrectly estimating it.

In Official Statistics (OS), two main types of uncertainty are of particular interest: (1) prediction uncertainty, i.e., the uncertainty in the model’s prediction for a previously unseen data point, and, perhaps more importantly, (2) inference uncertainty, which pertains to the estimation of a population parameter from sample data. We propose tackling prediction uncertainty using Conformal Prediction (CP) techniques,⁵ and parameter uncertainty via Prediction-powered Inference (PPI).⁶ Both utilize predictions from a pre-trained model applied to labeled and unlabeled datasets.

CP is a model-agnostic, distribution-free framework for quantifying prediction uncertainty. It relies on labeled data to construct uncertainty measures, yielding prediction sets for classification and prediction intervals for regression tasks. Similarly, PPI uses minimal distributional assumptions to derive robust statistical properties from model predictions, but it also produces statistically valid, noise-free confidence intervals for population parameters such as means, quantiles, and regression coefficients. Recent work has sought to derive parameter confidence intervals from CP’s prediction intervals (e.g., conformal confidence regions⁷), which appear valid for finite samples without strong noise assumptions. However, this analysis relies on PPI, given its established methodology and the likely verification of its noise assumptions with most datasets.

2. Related work

The literature on integrating machine learning into Official Statistics has grown substantially in recent years. In their manifesto, Puts, Salgado, and Daas⁸ outline the broad vision of integrating ML in OS, focusing on opportunities and challenges that need to be addressed. Specifically, they call for a balanced approach to preserve the statistical rigor necessary for policy-relevant outputs. Dumpert et al.⁹ propose a comprehensive quality concept for using ML in OS. They extend traditional quality frameworks by incorporating dimensions such as accuracy, reproducibility, explainability, timeliness, and cost-effectiveness, focusing on operational contexts as they provide guidelines for the systematic evaluation and monitoring of ML algorithms in statistical production processes. In a complementary vein, Molladavoudi and Yung¹⁰ focus on the quality dimension of trustworthy machine learning. Their work handles model explainability and uncertainty quantification, as well as practical insights to embed these dimensions within existing quality assurance frameworks at national statistical offices. Van Delden, Burger, and Puts¹¹ articulate ten propositions on the role of ML in OS, aiming to provide strategic recommendations on methodological considerations and operational challenges. Additionally, Nunes and Ashofteh¹² review the operational dimensions of big data and machine learning operations, with a particular focus on the adoption of MLOps frameworks. Finally, Breidt and Opsomer¹³ demonstrate how traditional model-assisted survey estimation can be enhanced with modern prediction techniques, showing how variance reduction and improved efficiency can be achieved when integrating machine learning predictions into established statistical estimators.

A broad literature on uncertainty quantification (UQ) complements recent work on conformal prediction (CP) and prediction-powered inference (PPI). Classical UQ methods include Bayesian inference, via Bayesian neural networks (BNNs) and full posterior or MCMC-based predictive distributions,^14–16 resampling and simulation techniques such as the bootstrap^17,18 and parametric Monte Carlo,¹⁹ and ensemble or committe methods including bagging,²⁰ random forests,²¹ deep ensembles,²² and MC-dropout.²³ These approaches provide rich representations of epistemic and aleatoric uncertainty and are effective when model structure and priors are credible, but their validity typically depends on correct model specification, prior assumptions, or large-sample asymptotics. In contrast, CP yields model-agnostic, distribution-free finite-sample marginal coverage under exchangeability,^5,24 making it robust to misspecification and out-of-distribution inputs, though its guarantees are marginal rather than conditional and can be conservative under heteroskedasticity.²⁵ PPI uses predictions to improve efficiency in estimating population parameters while retaining frequentist validity or confidence intervals under mild regularity conditions,⁶ in contrast to fully Bayesian posterior-based inference.

Uncertainty quantification methods differ substantially in their computational costs. Full Bayesian approaches, such as BNNs with MCMC integration, offer detailed posterior uncertainty but are computationally intensive and difficult to scale to large models due to costly sampling and integration steps. Approximate Bayesian techniques such as MC dropout and deep ensembles reduce this cost but still require multiple forward passes or model training procedures, increasing inference time and memory requirements with the number of samples or ensemble members. In contrast, CP is model-agnostic and computationally lightweight: once a predictor is trained, as it primarily involves calibration on a held-out set rather than retraining or sampling, yielding valid coverage under mild exchangeability assumptions with minimal overhead. PPI similarly builds on fixed model predictions to produce confidence intervals for population parameters, with the main cost tied to aggregating predictions across labeled and unlabeled data rather than sampling or hyperparameter tuning. In general, Bayesian and ensemble-based UQ methods aim to model uncertainty and can yield detailed estimates when their assumptions are well justified, whereas CP and PPI aim to guarantee validity with minimal reliance on correct model specification, which is crucial when considering the ever-increasing use of black-box large models to analyze unstructured data, especially from textual sources.

3. Methodology

Here, we briefly introduce the concept of Uncertainty Quantification (UQ) and later define the main characteristics and properties of CP and PPI, as well as illustrate their algorithms.

3.1. The premise of uncertainty quantification

UQ is a broadly explored field in predictive inference and machine learning.^26,27 Its high-level goal is to assess, represent, and mitigate uncertainty in prediction and inference to achieve robust, trustworthy results and improve model interpretability. At its core, the general premise of UQ is to systematically characterize and quantify different sources of uncertainty, ensuring that both aleatoric (inherent data randomness, irreducible by nature) and epistemic (model-related, reducible with better modelling) uncertainties are accounted for Der Kiureghian and Ditlevsen.²⁸ To illustrate this in Bayesian terms, we can write:

\begin{aligned} p (y | x, D) = \int p (y | x, θ) p (θ | D) d θ, \end{aligned}

where

p (y | x, D)

represents the likelihood of the outcome given the model parameters

θ

, and

p (θ | D)

is the posterior distribution over the parameters given the observed data

D

. In this formulation, aleatoric uncertainty is captured in the conditional likelihood

p (y | x, θ)

, while epistemic uncertainty is encoded in the posterior distribution over parameters

p (θ | D)

Softmax “probabilities”. In classification tasks, DNNs typically use a projection (classification) layer that maps a $d$ -dimensional input to a $k$ -dimensional space, where $k$ is the number of classes. The $s o f t m a x$ activation is often applied to the projection layer’s output:

\begin{aligned} s o f t m a x (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{k} e^{z j}}, \end{aligned}

where

z

is the

k

-dimensional output of the projection layer, and

z_{i}

is the

i

th dimension of the

z

vector. The

s o f t m a x

function transforms the vector elements into values between

0

and

1

, summing to

1

Prediction vs confidence intervals. Prediction intervals (PIs) and confidence intervals (CIs) capture conceptually different types of uncertainty. A PI is designed to represent the range within which a future observation $Y_{n + 1}$ is expected to fall with a pre-defined confidence level $α$ . Prediction intervals account for both the uncertainty in estimating the underlying model and the inherent variability in the data. For example, in a linear regression setting, a $1 - α$ prediction interval for a new observation given covariates $X_{n + 1}$ can be expressed as:

\begin{aligned} \hat{f} (X_{n + 1}) \pm z_{1 - α / 2} \sqrt{{\hat{σ}}^{2} + Var (\hat{f} (X_{n + 1}))}, \end{aligned}

where

\hat{f} (X_{n + 1})

is the point prediction,

{\hat{σ}}^{2}

is the sample variance of the residuals, and

t_{1 - α / 2}

is the quantile from the

z

-distribution. On the other hand, a CI quantifies the uncertainty in the estimation of a fixed model parameter, reflecting primarily epistemic (noise-free) uncertainty. For example, a

1 - α

confidence interval for the population mean

μ

based on a sample

{Y_{i}}_{i = 1}^{n}

is given by:

\begin{aligned} \bar{Y} \pm z_{1 - α / 2} \frac{s}{\sqrt{n}}, \end{aligned}

where

\bar{Y}

is the sample mean, s is the sample standard deviation, and

n

is the sample size. Therefore, CIs provide a measure of confidence in the estimated parameters, and PIs incorporate the additional uncertainty associated with predicting a new, individual outcome.

3.2. Conformal prediction

Conformal prediction is a statistical framework for UQ that derives prediction sets or intervals from a model’s predictions. Its appeal lies in its model-agnostic nature and its ability to provide finite-sample, distribution-free coverage guarantees under the mild assumption of data exchangeability. It uses a labeled calibration set and a notion of conformity, quantified through nonconformity scores, heuristic notions of uncertainty, to assess how well a new observation “conforms” to the observed data. Data exchangeability needs to be satisfied between data in the calibration set and new data points. When the assumption of data exchangeability is not satisfied, standard conformal methods no longer reliably achieve the nominal coverage level, and distribution-free guarantees break as the calibration ranking no longer reflects the distribution of new test points. Detailed theoretical foundations are discussed in Angelopoulos et al.²⁹ From now on, we will focus on classification tasks (prediction sets), and while notation can vary, conclusions are valid for regression problems (prediction intervals).

Coverage and adaptivity. The fundamental property that needs to be satisfied by frameworks that aim to measure prediction uncertainty is coverage, i.e., the statistical guarantee that the true value of a prediction $Y_{n + 1}$ is included with high probability in the set $C (X_{n + 1})$ generated by the process. Coverage can be marginal, if it is guaranteed on average over the whole input space, or conditional, if it is guaranteed on average for each input.³⁰ With a finite-sample correction, marginal coverage can be expressed as:

\begin{aligned} 1 - α \leq P [Y_{n + 1} \in C (X_{n + 1})] \leq 1 - α + \frac{1}{n + 1}, \end{aligned}

while conditional coverage can be expressed as:

\begin{aligned} 1 - α \leq P [Y_{n + 1} \in C (X_{n + 1}) | X_{n + 1} = x] \\ \leq 1 - α + \frac{1}{n + 1}, \forall x \in X . \end{aligned}

Under the assumption of data exchangeability, CP ensures that the

C (X_{n + 1})

covers the true outcome with a probability of at least

1 - α

. CP can only formally guarantee marginal coverage, but can get very close to conditional coverage if best practices are followed.²⁴

Achieving marginal coverage is necessary but not sufficient for a procedure to be useful, as techniques could provide coverage guarantees but prediction sets or intervals so wide that they become irrelevant for the analysis. Adaptivity is the ability to adjust to local variability in the data. We expect smaller sets for easy-to-classify examples, and larger sets for difficult-to-classify examples. In general, adaptive procedures lead to larger but more informative prediction sets. Several diagnostics can be used to evaluate adaptivity, including label-stratified coverage (LSC), which evaluates the coverage across all classes.

Basic algorithm. The most common implementation of CP is the split conformal method.²⁵ A pre-trained model $\hat{f}$ is used to predict the labels of a labeled dataset, called calibration set $D = {(X_{i}, Y_{i})}_{i = 1}^{n}$ . For each calibration point, a nonconformity score is computed via a score function $s (X, Y) \in R$ measuring the error between the prediction and true labels. An example of a score function for an adaptive CP procedure is:

s (X_{i}) = \sum_{j = 1}^{k} \hat{f} {(X_{i})}_{j},

(1)

where, after sorting the softmax scores in descending order,

k

represents the rank of the true class. The score for the

i

th input is the sum of all the softmax scores until the true class is reached. Then, the empirical

1 - α

quantile is computed for these scores. Following (1):

\begin{aligned} \hat{q} = \frac{⌈ (n + 1) (1 - α) ⌉}{n} . \end{aligned}

Prediction sets for new observations are then created by including all classes until the cumulative softmax score distribution reaches

\hat{q}

3.3. Prediction-powered inference

Prediction-powered inference (PPI) combines machine-learning predictions with a smaller set of trusted, “gold standard” labeled observations to perform statistically valid inference on population parameters. It corrects bias in imputation estimators by estimating a rectifier, which quantifies the error between model predictions and gold standard outcomes. Through the construction of a confidence interval around the rectifier, PPI adjusts the imputed estimator to produce tighter, valid prediction-powered CIs. Additionally, PPI supports hypothesis testing.

Algorithm. The first step in the PPI procedure is to identify a problem-specific rectifier that measures the bias of the imputed estimator. In the case of mean estimation:

\begin{aligned} \begin{aligned} θ^{*} = E [Y] \\ Population parameter \end{aligned} \begin{aligned} Δ = E [f (X) - Y] \\ Rectifier \end{aligned} \end{aligned}

where

f (X)

is the model’s prediction. The second step is to use the labeled data

D_{gold} = {(X_{i}, Y_{i})}_{i = 1}^{n}

to compute the empirical rectifier:

\begin{aligned} \hat{Δ} \pm t_{1 - α / 2, n - 1} \frac{{\hat{σ}}_{f - Y}}{\sqrt{n}}, \end{aligned}

where

{\hat{σ}}_{f - Y}

is the sample standard deviation of

f (X_{i}) - Y_{i}

. In parallel, we compute an imputed estimator for

θ *

using the abundant unlabeled data

{X_{i}}_{i = 1}^{N}

, with

N ≫ n

. In the case of mean estimation:

\begin{aligned} {\tilde{θ}}_{f} = \frac{1}{N} \sum_{i = 1}^{N} f (X_{i}) . \end{aligned}

While this estimator has low variance due to large

N

, it is generally biased because it solely relies on predictions. Therefore, we adjust the imputed estimator using the rectifier to form the prediction-powered estimator. The final prediction-powered CI for

θ *

incorporates uncertainty from both components: the imputed estimator and the empirical rectifier. A simple version with normal approximation is:

{\hat{θ}}_{P P} = {\tilde{θ}}_{f} - \hat{Δ} \to C I_{P P} = \hat{θ}_{P P} \pm z_{1 - α / 2} \sqrt{\frac{{\hat{σ}}_{f}^{2}}{N} + \frac{{\hat{σ}}_{f - Y}^{2}}{n}},

(2)

where

{\hat{σ}}_{f}^{2}

is the variance of

f (X)

computed from the unlabeled data, and

{\hat{σ}}_{f - Y}^{2}

is the variance of

f (X_{i}) - Y_{i}

from the gold standard data. If

N ≫ n

, the width of the confidence interval will almost entirely depend on

{\hat{σ}}_{f}^{2}

, meaning that a better model will achieve tighter CIs.

4. Applications

In this section, we show two Official Statistics use cases that handle non-traditional data sources where we apply CP and PPI techniques: (a) Predicting arrival ports from ship trajectories and (b) estimating the proportion of hate speech on social media.

4.1. Predicting arrival ports – conformal prediction

The Italian Institute of Statistics (Istat) is training deep learning models to predict arrival ports from ship trajectories of Automatic Identification System (AIS) signals to integrate survey-based official maritime statistics.³¹ Formally, the problem consists of training a model $\hat{f}$ to find the best approximation of:

\begin{aligned} p (Y | {X_{t}}_{t = 1}^{T}), with X_{t} \\ := [{latitude}_{t}, {longitude}_{t}, {speed}_{t}, {direction}_{t}], \end{aligned}

where

Y

is the arrival port,

X_{t}

is the individual AIS signal at time

t

, and

T

is the total length of the ship trajectory. The idea is to get a valid prediction set of ports for each trajectory.

Methodology. We train an attention-based bidirectional long short-term memory (AT-BiLSTM) network^32,33 that receives ship trajectories as input and outputs a softmax score distribution over the label space of arrival ports. In other words, it assigns a score to each arrival port for every trajectory. The model is trained on 42,339 labeled trajectories from the first quarter of 2022, calibrated on 4,137 labeled trajectories, and tested on 4,597 labeled trajectories.

\begin{aligned} \begin{aligned} D_{train} = {X_{i}, Y_{i}}_{i = 1}^{42,339} \\ Training Set \end{aligned} \begin{aligned} {D_{calib} = {X_{i}, Y_{i}}}_{i = 1}^{4,137} \\ Calibration Set \end{aligned} \\ \begin{aligned} {D_{test} = {X_{i}, Y_{i}}}_{i = 1}^{4,597} \\ Test Set \end{aligned} \end{aligned}

The label space has

K = 86

dimensions, that is, we have

86

unique ports in our dataset. The model is trained with a weighted cross-entropy loss function, and achieves a macro F1-score of

0.829

on the test set, and

0.840

on the calibration set.

Model overconfidence. To illustrate the problem of overconfidence, we can analyze the distribution of softmax scores assigned to the true class in incorrect predictions. Looking at Figure 1(right), we notice how the distribution is severely skewed towards very low values. Indeed, half of the mass lies below 0.18. This indicates that the model tends to make mistakes very confidently. Figure 1(left) also shows an individual example of this. If we used these scores as a statistically valid measure of uncertainty, we would severely underestimate it.

Figure 1.

Example of an overconfident incorrect prediction (left) and overall true-class softmax scores distribution over incorrect predictions (right).

Conformal prediction. A more rigorous approach is to apply split conformal prediction. To do this, we need to predict the labels for the calibration set $D_{calib}$ using our model. For each trajectory $X_{i}$ in $D_{calib}$ , we extract a softmax distribution over classes.

\begin{aligned} \begin{aligned} z_{i} = \hat{f} (X_{i}), z_{i} \in R^{86} \\ Output for i th trajectory \end{aligned} \begin{aligned} Z = {{z_{i j}}_{j = 1}^{86}}_{i = 1}^{4,137} \\ Softmax distributions \end{aligned} \end{aligned}

where

z_{i j} \in R

is the softmax score from the

j

th class extracted from the model’s prediction on the

i

th trajectory. Now, we define an adaptive score function and extract the empirical quantile

\hat{q}

using (1) and choosing

α = 0.1

. Then, we can build prediction sets for the test set

D_{test}

. In Figure 2(left), we show a sanity check for correct coverage, repeating the CP procedure for 5,000 different calibration and test set combinations as described in Angelopoulos and Bates.²⁴ As expected, the distribution is bell-shaped around the theoretical value of

0.9

. Figure 2(right) shows the distribution of the set sizes, i.e., the number of labels in the prediction set, extracted by running the procedure on the test set. While most of the mass lies in the 1–10 size interval, there is a significant amount of predictions whose sets are so wide that they are hardly informative.

Figure 2.

Empirical coverage distribution on the test set over $R = 5,000$ trials (left) and set-size distribution for the test set (right). The minimum set size is one.

To evaluate the adaptivity of our procedure, we can extract label-specific coverages, that is, we compute the coverage obtained by the model on each of the possible $86$ labels. Figure 3 shows the results of this label-stratified procedure. As we can see, $1 - α$ is obtained (or almost obtained) for most classes, with the lowest label-stratified coverage being just below 50%.

Figure 3.

Results of the label-stratified coverage (LSC) check. Each bar represents a label (arrival port), and each bar in grey represents coverage greater than or equal to $1 - α$ . The dashed red line shows the $1 - α$ threshold, $0.9$ , while the dotted blue line represents the scaled frequency distribution of labels in the dataset. Labels are sorted by most frequent to least frequent.

4.2. Estimating hate speech on X – prediction-powered inference

Here, the goal is to estimate the proportion of hate speech on X (formerly Twitter) over the total number of Italian Tweets related to migration and ethnic minorities.³⁴ Given a vast corpus of $N = 20,400,000$ posts on X across five years (2018-2022), we want to estimate the true proportion of hateful content. In addition to this vast unlabeled corpus, we have a labeled dataset of Italian X posts from the EVALITA 2020 campaign,³⁵ containing 8,100 data points, and another manually labeled dataset sampled from the corpus containing $681$ data points.

Methodology. To build a hate speech classifier, we fine-tune a multilingual robustly optimized BERT (XLM-R) model,³⁶ particularly the large version with 561 million parameters (The code used for fine-tuning can be found at https://github.com/istat-methodology/fine-tuning-pipelines). We split ${D_{H S} {X_{i}, Y_{i}}}_{i = 1}^{8,100}$ into a training set and a test set with a $0.2$ ratio, and we obtain a macro F1 score of $0.81$ on the test set.

Prediction-powered inference. We can use PPI to create a confidence interval around the estimated proportion of hateful speech. First, we define the prediction-powered estimator as:

\begin{aligned} {\hat{θ}}_{P P} \leftarrow {\tilde{θ}}_{f} - \hat{Δ} := \frac{1}{N} \sum_{i = 1}^{N} f ({\tilde{X}}_{i}) - \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}) - Y_{i} . \end{aligned}

Then, we need to compute the sample variance of the imputed estimate and empirical rectifier:

\begin{aligned} \begin{aligned} {\hat{σ}}_{f}^{2} = {\tilde{θ}}_{f} (1 - {\tilde{θ}}_{f}) \\ Sample variance of \\ imputed estimate \end{aligned} \begin{aligned} {\hat{σ}}_{f - Y}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {[f (X_{i}) - Y_{i} - Δ]}^{2} \\ Sample variance of \\ empirical rectifier \end{aligned} \end{aligned}

At this point, choosing

α = 0.05

and assuming normality, we can compute the prediction-powered CI for the estimated proportion using

D_{gold} = {X_{i}, Y_{i}}_{i = 1}^{681}

as:

\begin{aligned} C I_{P P} & = [{\hat{θ}}_{P P} - z_{1 - α / 2} \sqrt{\frac{{\hat{σ}}_{f}^{2}}{N} + \frac{{\hat{σ}}_{f - Y}^{2}}{n}}; {\hat{θ}}_{P P} \\ + z_{1 - α / 2} \sqrt{\frac{{\hat{σ}}_{f}^{2}}{N} + \frac{{\hat{σ}}_{f - Y}^{2}}{n}}] = [0.065; 0.131] \end{aligned}

Additionally, we could propagate the prediction-powered confidence interval throughout the whole period, obtaining a confidence interval for the time series (Figure 4). PPI could also easily let us test whether the difference between the proportions of hate speech in two periods of time is statistically significant.

Figure 4.

Prediction-powered confidence interval for the 2018 daily series of the proportion of hate speech on X.

Limitations. Recalling (2), when building the confidence interval for the time series, we are basically utilizing the same interval width over the whole series, since the interval width depends mostly on the variability of the empirical rectifier, if $N ≫ n$ , which is estimated independently of time.

5. Conclusion

In this study, we have shown that rigorous uncertainty quantification in machine learning, essential for official statistical production, can be achieved by combining model-agnostic frameworks with high-throughput prediction systems. Our work presents two complementary methodologies: conformal prediction (CP) for quantifying prediction-level uncertainty, and prediction-powered inference (PPI) for obtaining statistically valid, noise-free confidence intervals for population parameters. CP allows us to construct prediction sets and intervals that adapt to local data variability while providing finite-sample coverage guarantees, ensuring that even out-of-distribution data are accompanied by reliable uncertainty estimates. In parallel, PPI exploits both the vast amounts of unlabeled data and a much smaller set of gold standard observations to correct biases inherent in naïve imputation approaches, resulting in more precise inference for population parameters of interest. Because the dominant component of the computational budget lies in fitting the underlying predictive model, while the execution of the uncertainty-quantification procedures considered here, namely conformal prediction (CP) and prediction-powered inference (PPI), introduces only marginal additional overhead, these methods are, at least in principle, operationally feasible for large-scale deployment within Official Statistics. In particular, once a model has been trained, CP and PPI require only lightweight post-processing steps (such as nonparametric calibration or residual-based corrections), whose computational complexity grows linearly with the calibration sample and is negligible relative to model training. This computational profile is especially advantageous in production environments where indicators must be generated repeatedly or at high frequency, since a single trained model can be reused to deliver uncertainty assessments across multiple outputs without materially increasing runtime or resource consumption.

The real-world use cases we explored – predicting arrival ports in maritime statistics and estimating the prevalence of hate speech on social media – illustrate how these frameworks can be integrated into Official Statistics. By a statistically rigorous uncertainty quantification, these methods can increase the interpretability and trustworthiness of machine learning outputs, as well as provide guidance and additional diagnostics during model development.

While CP and PPI naturally have limitations, like CP’s reliance on the assumption of data exchangeability, already tackled by recent variants,³⁷ and require careful calibration as well as additional data, they prove to be especially flexible and scalable compared to most uncertainty quantification techniques.

Footnotes

ORCID iD

Francesco Ortame

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Yung

Karkimaa

Scannapieco

, et al. The use of machine learning in official statistics. UNECE Machine Learning Team report, 2018.

Dumpert

. Machine learning in ffficial statistics: a preface-like introduction. In: Foundations and advances of machine learning in official statistics. Springer Nature Switzerland Cham, 2025, pp.1–12.

Guo

Pleiss

Sun

, et al. On calibration of modern neural networks. In: International conference on machine learning. PMLR, 2017, pp.1321–1330.

Pearce

Brintrup

Zhu

. Understanding softmax confidence and uncertainty. arXiv preprint arXiv:210604972, 2021.

Vovk

Gammerman

Shafer

. Algorithmic learning in a random world. Boston, MA: Springer US, 2005.

Angelopoulos

Bates

Fannjiang

, et al. Prediction-powered inference. Science 2023; 382: 669–674.

Guille-Escuret

Ndiaye

. From conformal predictions to confidence regions. arXiv preprint arXiv:240518601, 2024.

Puts

Salgado

Daas

. Leveraging machine learning for official statistics: a statistical manifesto. arXiv preprint arXiv:240904365, 2024.

Dumpert

, et al. A quality concept for the use of machine learning in official statistics. In: UNECE machine learning for official statistics workshop 2023, Geneva, 5–7 June, 2023.

10.

Molladavoudi

Yung

. Exploring quality dimensions in trustworthy Machine Learning in the context of official statistics: model explainability and uncertainty quantification. AStA Wirtsch Sozialstat Arch 2023; 17: 223–252.

11.

van Delden

Burger

Puts

. Ten propositions on machine learning in official statistics. AStA Wirtsch Sozialstat Arch 2023; 17: 195–221.

12.

Nunes

CER

Ashofteh

. A review of big data and machine learning operations in official statistics: MLOps and feature store adoption. In: 2024 IEEE 48th annual computers, software, and applications conference (COMPSAC). IEEE, 2024, pp.711–718.

13.

Breidt

Opsomer

. Model-assisted survey estimation with modern prediction techniques. Stat Sci 2017; 32: 190–205.

14.

MacKay

. A practical Bayesian framework for backpropagation networks. Neural Comput 1992; 4: 448–472.

15.

Neal

. Bayesian learning for neural networks. Vol. 118. New York, NY: Springer Science & Business Media, 2012.

16.

Magris

Iosifidis

. Bayesian learning for neural networks: an algorithmic survey. Artif Intell Rev 2023; 56: 11773–11823.

17.

Efron

. Bootstrap methods: another look at the jackknife. In: Breakthroughs in statistics: ,ethodology and distribution. Springer, 1992, pp.569–593.

18.

Efron

Tibshirani

. An introduction to the bootstrap. New York, NY: Chapman and Hall/CRC, 1994.

19.

Luengo

Martino

Bugallo

, et al. A survey of Monte Carlo methods for parameter estimation. EURASIP J Adv Signal Process 2020; 2020: 25.

20.

Breiman

. Bagging predictors. Mach Learn 1996; 24: 123–140.

21.

Wager

Hastie

Efron

. Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 2014; 15: 1625–1651.

22.

Lakshminarayanan

Pritzel

Blundell

. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv Neural Inf Process Syst 2017; 30.

23.

Gal

Ghahramani

. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International conference on machine learning. PMLR, 2016, pp.1050–1059.

24.

Angelopoulos

Bates

. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:210707511, 2021.

25.

Lei

G’Sell

Rinaldo

, et al. Distribution-free predictive inference for regression. J Am Stat Assoc 2018; 113: 1094–1111.

26.

Abdar

Pourpanah

Hussain

, et al. A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inform Fusion 2021; 76: 243–297.

27.

Smith

. Uncertainty quantification: theory, implementation, and applications. Philadelphia, PA: SIAM, 2024.

28.

Der Kiureghian

Ditlevsen

. Aleatory or epistemic? Does it matter? Structural safety 2009; 31: 105–112.

29.

Angelopoulos

Barber

Bates

. Theoretical foundations of conformal prediction. arXiv preprint arXiv:241111824, 2024.

30.

Foygel Barber

Candes

Ramdas

, et al. The limits of distribution-free conditional predictive inference. Inform Inference 2021; 10: 455–482.

31.

Pappagallo

Ortame

Massacci

, et al. Deep learning for the classification of ports in maritime transport statistics via AIS data. In: International conference on learning and intelligent optimization. Springer, 2024, pp.318–332.

32.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9: 1735–1780.

33.

Schuster

Paliwal

. Bidirectional recurrent neural networks. IEEE Trans Signal Process 1997; 45: 2673–2681.

34.

Bruno

Catanese

Ortame

. Towards a hate speech index with attention-based LSTMs and XLM-RoBERTa. In: Proceedings of the 10th Italian conference on computational linguistics (CLiC-it 2024), 2024, pp.106–113.

35.

Basile

Di Maro

Croce

, et al. Evalita 2020: overview of the 7th evaluation campaign of natural language processing and speech tools for Italian. In: CEUR workshop proceedings. Vol. 2765. CEUR-ws, 2020.

36.

Conneau

Khandelwal

Goyal

, et al. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp.8440–8451.

37.

Barber

Candes

Ramdas

, et al. Conformal prediction beyond exchangeability. Ann Stat 2023; 51: 816–845.

Statistical frameworks for reliable machine learning predictions and inference

Abstract

Keywords

1. Introduction

2. Related work

3. Methodology

3.1. The premise of uncertainty quantification

3.2. Conformal prediction

4.1. Predicting arrival ports – conformal prediction

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

References