Sage Journals: Discover world-class research

Abstract

Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or overconfident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, to avoid overconfident subsystems dominating the output. Such awareness also underpins assurance in safety or business-critical contexts and builds user trust in models. This article provides a comprehensive review of probability calibration metrics for classifier models, organizing them according to multiple groupings to highlight their relationships. We identify 94 metrics, and group them into four main families: point-based, bin-based, kernel or curve-based, and cumulative. For each metric, we catalog properties of interest and provide equations in a unified notation, facilitating implementation and comparison by future researchers. Finally, we provide recommendations for which metrics should be used in different situations.

Keywords

classification calibration confidence multiclass uncertainty

1. Introduction

Artificial intelligence (AI) and machine learning (ML) models have seen widespread adoption in recent years. When such models are used in safety or business-critical applications, it is vital to be able to understand and assure their behavior. Models generate predictions accompanied by confidence scores or probabilities, which are considered calibrated when they accurately reflect the proportion of correct classification decisions. However, confidence scores are not always representative of true probabilities. Assessing model calibration, particularly under any operational conditions that differ from training, requires a robust measure of calibration quality.

The publication of Guo et al. (2017), which highlighted examples of miscalibration in deep neural networks, sparked an intense interest in the concept of calibration and metrics to measure it. This surge in attention has led to numerous publications over the last few years, with approximately ten new metrics being defined and proposed every year since then, and many papers discussing their merits or application. However, as Flach and Song (2020) witheringly put it in their conclusion, “contrary what recent machine learning literature may lead you to believe, calibration research predates machine learning and has been studied for three-quarters of a century.” Thus, there is a large body of work from which to draw on. Silva Filho et al. (2023) were motivated to write a survey on assessing and improving classifier calibration as “the literature on post-hoc classifier calibration in ML is now sufficiently rich that it is no longer straightforward to obtain or maintain a good overview of the area.” This lack of understanding has caused an impediment to research, requiring scientists to examine numerous papers to find relevant information. Indeed, despite the valuable survey of Silva Filho et al. (2023), it only covers a small subset of available metrics.

This article presents a comprehensive review of classifier probability calibration metrics. The aim of the review is to serve as a useful reference for researchers attempting to understand the landscape of such metrics. Contributions of the article are as follows:

A wide-ranging survey of classifier probability calibration metrics for models with a discrete output is provided. This review addresses a gap in the literature for which there is a high demand. Maier-Hein et al. (2024) in their major review of general classifier metrics, originally intended to omit the class of calibration metrics but the decision was reversed due to high demand expressed through crowdsourced feedback. Nevertheless, that review only covers a small fraction of the calibration metrics described in the present article.

Metrics are organized according to several novel categorizations to understand the relationships between them. The main categorization is the four families of point-based, bin-based, kernel or fitted-curve, and cumulative metrics. Each family has advantages and disadvantages, as do individual metrics within families. These are described in detail in the article.

This article represents the most comprehensive survey of probability calibration metrics to date, providing descriptions of significantly more metrics than those in the overlapping lists provided by other authors. The five most similar reviews are by Flach and Song (2020; who describe seven metrics), Hagopian et al. (2023; 13 metrics), Silva Filho et al. (2023; nine metrics), Maier-Hein et al. (2024; eight metrics), and Tao et al. (2024a; 11 metrics). Between them, these five reviews discuss 26 metrics, whereas the present article analyzes 94 metrics.

Where relevant, equations are provided with a unified notation to facilitate implementation and comparison by future researchers. Original papers that first describe metrics use a wide variety of different notations, making comparisons difficult.

Metrics that were previously treated in isolation are brought together and conceptual relationships are highlighted. Several authors have independently invented the same metric but given them a different name. This review treats such metrics as a single entry and consolidates the separate analysis provided by the original authors. Some metrics are special cases of others but have not previously been associated. These connections are noted along with integrated discussions. Other metrics that are nominally from different families, but have conceptual similarity, are discussed together, enabling cross-fertilization of ideas between different research groups.

Finally, we recommend specific metrics for general, multiclass, and local calibration scenarios.

The remainder of this article is organized as follows. Section 2 introduces key concepts relating to calibration metrics and defines major notations used in this article. Sections 3–6 respectively describe metrics in the point-based, bin-based, kernel or fitted-curve, and cumulative families. A dendrogram showing the hierarchy of families and subfamilies is shown in Figure 1. Conclusions and recommendations are given in Section 7. Appendix A describes lesser-used metrics. Appendix B includes further discussion on skill scores, bootstrapping and consistency sampling, and consistency calibration. Appendix C contains a comprehensive list of symbols and their basic definition. Finally, Appendix D contains a table that summarizes the main pros and cons of each metric and other information, such as the range of attainable values and alternative names for the same metric.

Figure 1.

Hierarchy of probability calibration metric families and subfamilies.

2. Calibration Metric Concepts

2.1. Notation

The literature on probability calibration metrics is inconsistent in terminology and notation. We describe techniques in a unified way, rather than using original symbolizations. We start by defining the problem of interest to be assigning a piece of data or “datapoint” to one of K classes. The data could be a feature vector, image, signal, time series, or any other collection of information. Each datapoint, while potentially representing multiple numbers, is treated here as a single, possibly multidimensional, entity. Each datapoint i has an associated true label $y_{i}$ . If the classifier is binary, then $y \in {0, 1}$ by convention. If the classifier is multiclass (with three or more classes), then $y \in {1, \dots, K}$ or is encoded as a one-hot vector, as determined by context. When multiclass problems are decomposed into binary ones, $y_{i}$ represents making a correct decision for a particular class. $I [y = k]$ is the indicator function, which is unity if $y = k$ and zero otherwise. The number of datapoints over which the calibration of a model is to be assessed is N. The vector $\bar{y}$ is a length- $K$ vector of true proportions of each class in the dataset.

For each datapoint, the model produces a predicted probability or confidence $c_{i}$ . For binary problems, by convention this is a scalar value representing the probability of Class 1. In this case, there is no need to specify the probability of class 0 as it is completely determined by $1 - c_{i}$ . This simplifies the analysis. For multiclass problems, $c_{i}$ is a vector of probabilities for each class that sums to unity. When discussing the extension of binary classification to multiclass classification, the multiclass notation may be used for binary classes. That is, $c_{i}$ may be a two-element vector representing probabilities of the two classes.

A more comprehensive list of symbols used in this article and their basic definition is given in Table 3 in Appendix C. Each term is described in more detail where it first appears in the article.

2.2. Calibration Curve and Reliability Diagram

An ideally calibrated classifier outputs confidence scores or predicted probabilities equal to its accuracy, conditional on the score. The term “accuracy” is also named in various sources as “actual positive rate,” “fraction of positives,” “empirical probability,” or “observed relative frequency.” In practice, the accuracy for a particular confidence value can be higher or lower than that confidence. The actual accuracy as a function of confidence for a particular class is known as the calibration curve. An example theoretical nonperfect calibration curve is illustrated in Figure 2, along with the ideal calibration line. In this example, the classifier is overconfident in the target class—the achieved accuracy is lower than the model's confidence. In binary classification, the calibration curve for Class 1 contains all the information necessary to understand the model's calibration, as the curve for Class 0 is its complement. Multiclass models are more complex, as discussed in Section 2.3.

Figure 2.

Theoretical calibration curve with two example datapoints having the same confidence, but different true labels. The red arrows indicate the calibration errors for those individual datapoints. The curve represents an overconfident classifier.

For a single datapoint, the classifier is either correct or incorrect, and the model can only perfectly be calibrated if the confidence is zero or unity. For other confidence values, there is inevitably some calibration error. Figure 2 demonstrates this for two example datapoints with the same confidence value of 0.7 for Class 1. The true label is zero for one datapoint and the error for that datapoint is 0.7, indicated by the lower red arrow. The true label for the other datapoint is unity, and the error for that datapoint is 0.3, indicated by the higher red arrow.

Although perfect calibration for a single datapoint is impossible without perfect accuracy, a classifier with nonperfect accuracy over a set of datapoints may still be well calibrated. For example, if 10 datapoints all have a confidence of 0.7 and seven out of those 10 are classified correctly, then this classifier is perfectly calibrated for the dataset. In practice, not all confidence values output by a model are expected to be identical, so confidence values are often grouped into nonoverlapping bins and the accuracy of datapoints in each bin is analyzed. The visual representation of this information is known as a reliability diagram, curve, or plot. This is frequently shown as a bar chart, see Guo et al. (2017) for example. However, line plots facilitate better visual inspection of the data because they show trends over the underlying continuous confidence variable. Some works place markers at the center of each bin on the horizontal confidence axis. However, if the data are not uniformly distributed throughout the bin this can give misleading results. Therefore, it is better to place markers at the mean confidence for each bin (Silva Filho et al., 2023). Vasilev and D'yakonov (2023) distinguish between the bar chart representation being called the reliability diagram and the line representation being called the reliability plot.

An example reliability plot is shown in Figure 3 as a solid line, and additional information is included. The dataset used to compute this diagram was generated as 500 random samples, with confidence values and true labels determined based on the true calibration curve in Figure 2. In Figure 3, true labels are jittered by ±0.025 for improved visualization. The markers on the reliability curve show the mean achieved accuracy in 10 equal-width bins, and the error bars represent the standard error of those estimates. The standard error is greater for confidence values near 0.5 than zero or unity and is in general greater when there are fewer datapoints, although this effect is not apparent for this example dataset. The measured reliability curve is approximately the same shape as the true calibration curve. The same information is shown in Figure 4 using the slightly more common bar chart representation.

Figure 3.

Reliability line plot for 500 labeled datapoints, with ten bins. The plot shows that, for a finite dataset, empirical accuracy does not always increase with confidence.

Figure 4.

Reliability bar chart for 500 labeled datapoints, with 10 bins. The chart emphasizes EW of the bins.

Most binary calibration metrics measure some aspect of the data visible in Figures 2, 3, or 4, whether this is the location of the datapoints, or the degree to which the calibration curve, as estimated via binning or other means, deviates from the identity line.

2.3. Multiclass Aspects

Most work on probability calibration metrics relates to binary classifiers. However, multiclass versions also exist. There are three ways to define calibration for multiple classes (Silva Filho et al., 2023):

Top-label. Top-label calibration considers only the class with the highest predicted probability. Some classifiers only report probabilities for the most likely class. In those situations, this is the only calibration that can be assessed. Top-label calibration is also called confidence calibration and is the most common form of calibration.

Class-wise. Class-wise calibration assesses the marginal probabilities of all classes. This definition requires all one-versus-rest (OVR) binary estimators to be calibrated individually. This is a more restrictive definition than top-label as it is not possible for under and overestimates of probability from different classes to cancel out.

Multiclass. Multiclass calibration requires the entire vector of probabilities to be correct in all elements simultaneously. This is also known as joint, full, canonical, or strong calibration.

Vaicenavicius et al. (2019) give theoretical examples where a multiclass classifier is either top-label calibrated or class-wise calibrated, but not fully multiclass calibrated. Lack of full multiclass calibration could be important in safety critical applications, especially where the action taken should depend not only on the most likely outcome, but also the probabilities of other less-likely outcomes, which may have severe negative consequences.

Multiclass problems are often decomposed into binary subproblems whose outputs are aggregated to give an overall multiclass score (Kull & Flach, 2014). The most common decompositions are OVR or comprehensive pairwise. The advantage of such decompositions is that binary classification theory and practice, including probability calibration, can be applied to subproblems without the complication of multiclass issues. Furthermore, some classifiers are inherently designed only to work with binary problems. Decomposition allows these classifiers to be applied in a multiclass setting without further modification. The disadvantage of OVR decompositions is that only class-wise calibration can be measured, not full multiclass calibration.

2.4. Other Observations on Calibration

Numerous studies have empirically shown that classifiers often exhibit overconfidence. The usual explanation given is that the models are large enough to memorize training data and maximize confidence. However, Bai et al. (2021) show that certain classifiers are inherently overconfident, even when the dataset is large and the number of model parameters is small. Specifically, this applies to logistic regression and other classifiers where the activation function is symmetric and concave in the positive half. Munir et al. (2022) state that rectified linear unit activation functions, widely used in modern deep neural nets, and similar piece-wise linear functions, are a core reason behind overconfident predictions in test data far away from the training data. Minderer et al. (2021) show that model structure is more important than model size in understanding probability calibration, and that recent models, especially those not using convolutions, are in fact among the best calibrated.

The metrics described in this article serve as absolute measures of model calibration. Recalibration techniques attempt to improve the correctness of model confidence values by post processing. The aim of such techniques is to invert the effect of the model calibration curve so that the overall process produces the identity function. In practice, these techniques are not perfect, but their “calibration gain” can be defined, which is the improvement in calibration error (by any metric) when the technique is applied (Zhang et al., 2020).

2.5. Scope of Article

A systematic and comprehensive review of classifier probability calibration metrics was conducted as follows. First, Google Scholar was used to generate a long list of the top 100 papers that included the words “probability,” “calibration,” and “metric,” starting from the year of the seminal paper by Guo et al. (2017) that spurred recent widespread interest in probability calibration. From this long list, papers that did not describe a new metric or compare existing metrics, were removed, leaving a short list. Metrics that can be used as classifier probability calibration metrics, even if they were originally designed for another purpose, such as a loss function, were defined as in scope. Forward and backward citation searching, seeded from this short list, was performed to cover any gaps from the initial search. The result of this process was a list of papers that collectively comprehensively describe classifier probability calibration metrics. The metrics were then analyzed, grouped into families, and explained in this article. We include new insights into relationships and properties of these metrics.

Although this review focuses on classifiers, object detection models also produce probabilistic outputs and pose related calibration challenges. Whereas as a classifier only considers the discrete label of a piece of data, an object detector also provides object size and location co-ordinates, often within an image. Although object detector confidence scores can be assessed based only on labels of associated ground-truth objects, the additional degrees of freedom allow a more nuanced form of assessment. Analysis of object detection is out of scope for this article. Example probability calibration metrics specific to object detection are discussed by Neumann et al. (2018), Küppers et al. (2020), Conde et al. (2023), Oksuz et al. (2023), Popordanoska et al. (2024), and Kuzucu et al. (2025).

This survey concentrates on the assessment of model probabilities rather than methods to improve calibration. Reviews of such methods are covered well by Flach and Song (2020) and Silva Filho et al. (2023). The present article focuses on reporting practical mathematical definitions of metrics, their properties, why they might be used, and a summary of existing performance analyses. Experiments that perform numerical comparisons of small subsets of metrics have been conducted elsewhere—see Widmann et al. (2019), Zhang et al. (2020), Gruber and Buettner (2022), Roelofs et al. (2022), Matsubara et al. (2023), Peng et al. (2023), or Kängsepp et al. (2025), for example. New experiments to provide a comprehensive comparison would be a significant undertaking requiring thousands of hours of compute time (Kängsepp et al., 2025) and are not included here for space reasons. A full description of the in-scope probability calibration metrics, arranged by family, now follows.

3. Point-Based Metrics

3.1. Introduction to Point-Based Metrics

Point-based metrics compute a score for each datapoint $(y_{i}, c_{i})$ and aggregate these scores to give an overall calibration measure for the whole dataset. Although easy to define and compute, point-based metrics can only achieve perfect calibration scores when a classifier is 100% confident and is always correct. For lower confidence scores there is always some discrepancy between the confidence and the correct decision for a particular datapoint, as illustrated by the datapoint calibration error in Figure 2. Point-based metrics often have simpler definitions than other types of metric. However, some metrics lack interpretability, despite having useful mathematical properties, limiting their use. Section 3.2 describes the abstract concept of “proper scores,” which provides a framework for analyzing the properties of point-based metrics. The remainder of this section from 3.3 onward describes specific point-based metrics.

3.2. Proper Scores

Proper scoring rules are point-based evaluation measures for probability estimates that avoid the need to put confidence scores into bins (Silva Filho et al., 2023). Following Bröcker (2009), a scoring rule $S (c, y)$ is a function of a probability prediction (or confidence) c and an outcome y. Let $\bar{y}$ be a length- $K$ vector representing the proportion in the data of each class, conditional on c. A scoring function, or expected score, is defined as:

\begin{aligned} s (c, \bar{y}) = \sum_{k = 1}^{K} S (c, k) {\bar{y}}_{k} \end{aligned}

(1)

By convention, low scores indicate good predictions. A score is “proper” if the divergence $s (c, \bar{y}) - s (\bar{y}, \bar{y})$ is nonnegative, and “strictly proper” if zero divergence implies $c = \bar{y}$ . Informally, proper scores are optimal when predicted probabilities match the true proportions in the data (Silva Filho et al., 2023). A point-based calibration metric for a tangible dataset can be constructed by computing the mean of equation (1) over all datapoints.

Strictly proper scores can be decomposed into three terms: reliability, resolution, and uncertainty, which facilitates interpretation of the score (Bröcker, 2009):

\begin{aligned} S c o r e = R e l i a b i l i t y - R e s o l u t i o n + U n c e r t a i n t y \end{aligned}

(2)

Reliability is a measure of the degree to which predictions differ from the actual sample relative frequencies (Murphy, 1973), with lower values being good. In early works, reliability was referred to as validity, but it is often now known as calibration loss (Silva Filho et al., 2023). As an illustrative example, if a classifier only produces confidence values of exactly 90% or 70%, and is respectively correct 90% or 70% of the time for those specific values, then the classifier is “reliable” and has a reliability value of zero.

Resolution quantifies how much the sample proportions for each unique predicted probability differ from the overall sample proportions for the whole dataset (Murphy, 1973), with higher values being better. For example, consider a binary dataset where 50% of the datapoints belong to either class. A classifier that always gives a confidence of 50% and chooses randomly between the two classes would have an accuracy of 50% and be well-calibrated globally, but not particularly useful. This classifier has zero resolution. An alternative, reliable classifier that randomly gives confidence scores of 80% or 20% half the time to Class 1, has a better, nonzero resolution, despite also having an accuracy of only 50%. Resolution is also called sharpness (Bröcker, 2009).

Uncertainty is the score that would be achieved by replacing confidence values with the proportions of the actual samples (Murphy, 1973), with lower being better. Thus, this term is inherent to the data and does not relate to the classifier. For example, if 80% of the datapoints are from one class, a theoretical classifier that always gives a confidence value of 80%, would achieve a score equal to the uncertainty value. If all datapoints are from one class, the uncertainty is zero. In a binary classifier, the uncertainty is highest when half of the datapoints belong to each class.

The three-term decomposition of scores is only useful when more than one datapoint has the same predicted probability vector. This may be the case for human forecasters that are prone to specifying probabilities on a discrete scale (10%, 20%, etc.). However, algorithms specify probabilities on a continuous scale from zero to unity and it is unlikely that many predicted probability vectors will have the same exact value. When all probabilities are different, the resolution and uncertainty terms cancel out, and the reliability is the same as the overall score. Thus, the decomposition has less utility in modern algorithmic contexts than traditional human prediction analysis.

An alternative decomposition considers proper scores as a sum of two components: epistemic loss, due to the model not being optimal, and an irreducible or aleatoric loss, which is the loss of the theoretically optimal model, due to randomness in the data (Silva Filho et al., 2023). This decomposition helps focus analysis on parts of the problem that are in control of model designers. Other decompositions are also available (Popordanoska, 2025).

Silva Filho et al. (2023) recommend that classifiers should be trained using a proper scoring rule as loss function rather than nonproper functions. This is because the resulting models are likely to produce better probabilities, since probability refinement and calibration would be encouraged during training.

3.3. Brier Score

The Brier score (BS) is one of the earliest and best-known calibration metrics. It is a common way of measuring how much the accuracy of a model diverges from its confidence, with lower scores meaning the model is well calibrated. The BS is also called mean square error (MSE) (Flach, 2019) or the quadratic score (Gneiting & Raftery, 2007). The BS can be calculated by equation (3).

\begin{aligned} B S = \frac{1}{N K} \sum_{i = 1}^{N} \sum_{k}^{K} (c_{i k} - y_{i k})^{2} \end{aligned}

(3)

The score can be thought of as the mean of the squares of the arrowed data-point calibration-error line distances in Figure 2. Initially developed for weather forecasting, the BS is now widely used as a general measure of risk prediction. It has advantages that it is easily calculated and interpreted and is a strictly proper score. The original definition of BS for binary classifiers is in the range $[0, 2]$ . Modern usage divides the classical definition by the number of classes to give a value always in the range $[0, 1]$ (Silva Filho et al., 2023). Wallace and Dahabreh (2014) describe a stratified version of the BS that captures class-specific calibration. Verhaeghe et al. (2023) do not use the BS as it does not evaluate the “clinical value” of prediction models. They prefer to use expected calibration error (ECE) or its variants, which are described in Section 4.

The square root Brier score (RBS) is a robust estimator and upper bound of the canonical calibration error. RBS is compared to bin, kernel, and cumulative metrics in Gruber and Buettner (2022). Of the metrics tested, only RBS and the cumulative Kolmogorov–Smirnoff (KS) metric are consistent in value with respect to data size—a desirable property.

3.4. Logarithmic Metrics

Negative log likelihood (NLL), also called binary cross-entropy, ignorance (Bröcker, 2009), logarithmic score or predictive deviance (Gneiting & Raftery, 2007), logistic loss (for binary problems), or multinomial logistic loss (for multiclass problems) is calculated by the equation:

\begin{aligned} N L L = - \frac{1}{N K} \sum_{i = 1}^{N} \sum_{k = 1}^{K} I [y_{i} = k] \log c_{i k} \end{aligned}

(4)

The NLL takes on any nonnegative value and is strictly proper (Tödter & Ahrens, 2012). When NLL is small, the model is well-calibrated. Like the BS, NLL is easy to compute but conflates accuracy and calibration (Gupta et al., 2021). One issue with NLL is that if the confidence is zero for the correct label for any datapoint, the NLL evaluates to infinity. This reflects the fact that a good prediction system should never assign zero probability to possible events (Tödter & Ahrens, 2012). If the confidence is a small value instead of zero, the NLL can still be very large. Thus, the metric severely penalizes highly unlikely predictions, which may be an indicator of lack of calibration. However, this means that single datapoints can have a large effect on the overall metric value, which is an undesirable property (Gneiting & Raftery, 2007). Considering that some datasets have "label noise" where the supposed ground truth labels are incorrect for a small proportion of datapoints, this property of the metric has potential to cause major issues.

Focal loss (FL) is a modified version of NLL designed to focus on hard-to-classify examples (Lin et al., 2017). FL is defined as:

\begin{aligned} F L = - \frac{1}{N K} \sum_{i = 1}^{N} \sum_{k = 1}^{K} I [y_{i} = k] (1 - c_{i k})^{γ} \log c_{i k} \end{aligned}

(5)

When $γ > 0$ , FL down-weights the loss for high-confidence predictions, putting more emphasis on hard examples that are misclassified. The user-defined focusing parameter $γ$ is usually constant across all datapoints, often taking the value $γ = 2$ . However, Mukhoti et al. (2020) make it depend on the confidence value and Wang & Golebiowski (2024) set it on a per-data-point basis as part of a meta-learning framework. FL was originally introduced as a loss function for object detection model training, and its use improves calibration, despite it not being strictly proper. However, the loss can be used as a calibration metric. If $γ = 0$ then FL reduces to NLL. FL is usually only computed for the top-confidence class rather than the full class computation in equation (5). Dual FL is a modified version of FL that examines the top two most confident classes to consider the extent to which the top-confidence class leads the other classes (Tao et al., 2023). The factor $(1 - c_{i k})$ is modified to be $(1 - c_{i k 1} + c_{i k 2})$ , where $k_{{1, 2}}$ are the two most confident classes. Using dual FL as a training target results in models that are better calibrated than standard FL. Although FL is not strictly proper, it can be made so by adding a scaled version of the BS to make new metric, focal calibration loss (Liang et al., 2024).

Sumler et al. (2025) describe a general metric to measure the consistency of continuous or discrete probabilistic algorithms, including multitarget trackers, classifiers, multihypothesis trackers, and particle filters. They derive a simplified version of this metric when applied to binary classifiers and call it the entropic calibration difference (ECD). This is defined as:

\begin{aligned} E C D = \frac{1}{N} \sum_{i = 1}^{N} (c_{i} - y_{i}) \log [\frac{c_{i}}{1 - c_{i}}] \end{aligned}

(6)

ECD is a signed metric—it can take on any value on the real line. When positive, the classifier is overconfident. When negative, the classifier is underconfident. When zero, the classifier is perfectly calibrated. The signed nature of the metric is an advantage compared to other metrics that only give information on whether a classifier is calibrated without the direction of miscalibration. The absolute value of ECD is a proper scoring rule.

3.5. Global Metrics

Global metrics compute the mean confidence of all datapoints and compare this to the mean accuracy. The two numbers are expected to match for a calibrated system. This type of metric is included as a point based metric due to its use of simple sums over the datapoints. However, global metrics are also a bin-based metric with a single bin for the entire dataset. Individual metrics vary in how they compare the global sums. Due to the use of aggregate statistics, global metrics are not proper.

The global squared bias (GSB) is defined by Galbraith and Van Norden (2011) as:

\begin{aligned} G S B = \frac{1}{K} \sum_{k = 1}^{K} {(\frac{1}{N} \sum_{i = 1}^{N} c_{i k} - \frac{1}{N} \sum_{i = 1}^{N} y_{i k})}^{2} \end{aligned}

(7)

The GSB is a measure of the match between the unconditional mean predicted probability and the unconditional mean probability of the outcome. The metric, as defined in equation (7), takes on values between zero and unity. However, when comparing GSB with the old definition of BS, many authors omit the $1 / K$ scale factor in binary problems to keep both metrics on the scale of zero to two. Due to its global nature, the GSB provides only an imprecise measure of calibration error, as the metric can be zero if different parts of the reliability diagram are uncalibrated but cancel out overall. The GSB is equivalent to the mean-square calibration error (see Section 4.3) with a single bin for all probabilities. GSB is also called reliability-in-the-large (Hagopian et al., 2023).

Another metric like the GSB is the multiclass difference of confidence and accuracy (MDCA). It is defined by Hebbalaguppe et al. (2022) as:

\begin{aligned} M D C A = \frac{1}{K} \sum_{k = 1}^{K} | \frac{1}{N} \sum_{i = 1}^{N} c_{i k} - \frac{1}{N} \sum_{i = 1}^{N} y_{i k} | \end{aligned}

(8)

MDCA was designed to be used as an additional loss term for minibatches of data in neural net training to encourage calibrated models. However, it can be used to assess calibration of the whole dataset. The metric is differentiable, which allows it to be used as part of gradient descent algorithms. The metric is equivalent to the ECE (see Section 4.2) with a single bin for all probabilities.

Another global metric for binary problems is the ratio of the expected to observed (EO) number of datapoints in the target class:

\begin{aligned} E O = \frac{\sum_{i = 1}^{N} c_{i}}{\sum_{i = 1}^{N} y_{i}} \end{aligned}

(9)

The metric takes on nonnegative values on the real line. If $E O = 1$ the model is calibrated, if $E O < 1$ the model is underconfident, and if $E O > 1$ , the model is overconfident. This metric is unusual because it gives a measure of the direction of calibration as well as the magnitude of miscalibration in general. The few other metrics that have this property normally take on any value on the real line, with zero indicating perfect calibration. The inverse of this metric is sometimes reported, where it is called the observed to expected (OE) ratio (Riley et al., 2016).

3.6. Normalized Square Metrics

The Dawid–Sebastiani score (DSS) is a metric for models that output general probabilistic predictions, using the first two statistical moments of the predictive distribution. DSS is defined by Moore et al. (2022) as:

\begin{aligned} D S S = \frac{1}{N} \sum_{i = 1}^{N} [{(\frac{y_{i} - μ_{i}}{σ_{i}})}^{2} + 2 \log σ_{i}] \end{aligned}

(10)

In (10), $μ_{i}$ is the mean of the probabilistic prediction and $σ_{i}$ is the variance. This general formulation can apply to both continuous and discrete variables. For a binary classifier, $μ_{i} = c_{i}$ and $σ_{i}^{2} = c_{i} (1 - c_{i})$ , derived from the Bernoulli distribution. DSS takes on any nonnegative value and is a proper score, although it is not strictly proper. A multivariate version of DSS that takes on any value on the real line is discussed in Wilks (2020). This version of the metric can be used to assess multiclass problems.

The normalized squared error score (NSES) is defined by Moore et al. (2022) as:

\begin{aligned} N S E S = \frac{1}{N} \sum_{i = 1}^{N} [{(\frac{y_{i} - c_{i}}{σ_{i}})}^{2}] \end{aligned}

(11)

NSES takes on any nonnegative value. The metric is like the DSS but omits the logarithmic term. However, unlike DSS, NSES is not proper. Despite its impropriety, NSES is considered by Moore et al. (2022) to have valuable diagnostic properties, as it can be used to distinguish overconfidence $(N S E S > 1)$ from underconfidence $(N S E S < 1)$ . Indeed, NSES is often used in the tracking community as a performance measure for uncertain estimates of target locations where it is called normalized estimation error squared (NEES; Chen et al., 2018). If $y_{i} = c_{i}$ , which can only happen if both values are zero or unity and the classifier is 100% confident and correct, then the summand is defined as zero to avoid dividing by the zero value of $σ_{i}$ . However, if $c_{i} = 1 - y_{i}$ then the classifier is 100% confident and incorrect, causing the metric value to diverge. As with NLL this is an undesirable property.

3.7. P-Norm Metrics and Variants

The pointwise $l_{p}$ error (PWE) is a generalized metric based on the p-norm (De Leeuw et al., 2009) and defined as:

\begin{aligned} P W E_{p} = {(\frac{1}{N} \sum_{i = 1}^{N} {| y_{i} - c_{i} |}^{p})}^{1 / p} \end{aligned}

(12)

When $p = 1$ , this is the mean absolute error or total variation (Vaicenavicius et al., 2019; Widmann et al., 2019). When $p = 2$ , this is the root mean-square error, or RBS (see Section 3.3). When $p = \infty$ , this is known as the Chebyshev norm and is equivalent to picking the maximum absolute error (MAE) for a single datapoint. The Chebyshev norm is unstable, because it relies on a single datapoint and is therefore not recommended as a useful pointwise calibration metric. Values of p other than 1, 2, or $\infty$ in the $l_{p}$ norm are not typically used. The pointwise $l_{p}$ error takes on values between zero and unity for all values of p.

De Leeuw et al. (2009) define a modified version of $P W E_{1}$ called the L1eps error. This is defined as:

\begin{aligned} L 1 e p s = \frac{1}{N} \sum_{i = 1}^{N} \sqrt{{(y_{i} - c_{i})}^{2} + ε} \end{aligned}

(13)

This is similar to a pseudo Huber loss (Barron, 2019); see Appendix A.8 for the standard Huber loss. For small errors L1eps acts like the $l_{2}$ error, and for large errors it acts like the $l_{1}$ error. This makes it robust to outliers, while having continuous derivatives for all degrees. In contrast, the standard Huber loss has a continuous first derivative but discontinuous second derivative. The L1eps error takes on values between $\sqrt{ε}$ and $\sqrt{1 + ε} .$

4. Bin-Based Metrics

4.1. Introduction to Bin-Based Metrics

Point-based metrics such as the BS have been used to measure calibration for several decades. However, one issue with such metrics is that it is usually impossible for a classifier to achieve a perfect score under such systems. This is because if any confidence other than pure certainty is predicted for a particular datapoint, there will be a difference between the confidence and the label for the correct class, which is zero or unity by definition. Thus, the only way for a classifier to be assessed to have perfect calibration, is if it only ever assigns a confidence of 100% to the correct class and is always correct. This is impractical for any real dataset.

Bin-based metrics group datapoints with similar confidence values. The proportion of correct classification in each bin can then be compared to the mean confidence of the bin, or some other single value representative of all the datapoints in the bin. This allows imperfect, but well calibrated, classifiers to achieve a good calibration score. An illustration of binned data is shown in Figure 4. Conceptually, bin-based metrics are based on the difference between the binned accuracy and perfect calibration lines.

The remainder of this section describes several bin-based metrics and their advantages and disadvantages. One general disadvantage is that a small change in confidence value of one datapoint can cause it to be assigned to an adjacent bin, resulting in a discrete jump in metric value, which may be undesirable. The following notation specific to binned metrics is used. The number of bins is B, the number of datapoints in bin b is $n_{b}$ , and the proportion of data in bin b is $p_{b} = n_{b} / N$ . The mean confidence of bin b is ${\bar{c}}_{b}$ and the true proportion of data labels in that bin is ${\bar{y}}_{b}$ . The latter two variables may relate just to Class 1 in binary problems or may be vectors for multiclass problems. In an abuse of notation for simplicity, $i \in b$ denotes that datapoint i is in bin b. Unless otherwise stated, bin-based metrics take on values between zero and unity.

4.2. Expected Calibration Error

ECE is a widely used metric for quantifying the miscalibration of probabilistic classifiers, where lower values indicate better calibration (Guo et al., 2017). Some authors call ECE “empirical” or “estimated” instead of “expected,” as it is not a true expectation (Silva Filho et al., 2023). The binary-ECE is calculated via the formula:

\begin{aligned} E C E = \sum_{b = 1}^{B} p_{b} (| {\bar{y}}_{b} - {\bar{c}}_{b} |) \end{aligned}

(14)

The standard form of ECE uses equal width (EW) bins so it is sometimes named ECE-EW (Roelofs et al., 2022). ECE is also called the mean absolute calibration error (Lee et al., 2023). ECE is easy to compute and visualize—it is the absolute area between estimated and perfect calibration bars—see Figure 4.

Despite widespread use, ECE has several disadvantages. First, it is trivially possible to obtain perfect ECE by randomly estimating examples according to the label distribution (Liu et al., 2020). For example, in binary classification, if 60% of data examples are from Class 1, then assigning a confidence of 60% to all items, regardless of input features, will produce a perfectly calibrated classifier, according to ECE, but one with poor accuracy.

Second, some fixed bins may have very few or no datapoints. Thus, the calibration error for those bins either cannot be computed or has a very high variance, which is reflected in the overall ECE. Datapoint sparsity for some bins is demonstrated by Guo et al. (2017), who provide an example histogram of the number of samples in each bin for an image classifier. Low confidence bins have very few samples, and the lowest two bins have no samples at all.

Third, in multiclass problems, standard ECE is computed for the highest probability class only. In this case, it is known as top-label classification error (Kumar et al., 2019) or confidence-ECE (Silva Filho et al., 2023). However, it may be important to distinguish between second, third, or lower ranking classes when the assessment from the classifier under test is being combined with other information (Widmann et al., 2019).

Fourth, ECE conflates calibration and sharpness when a model is highly accurate (Nixon et al., 2019). Sharpness is the desire for models to predict with high confidence. This conflation issue is related to the two previous ones. Since only top-label confidence is analyzed, confidence values are naturally concentrated toward unity. As ECE places a low weight on sparsely populated bins with low confidence, the metric relies primarily on high confidence bins. For these high confidence values, an accurate classifier has tall bin heights that are near the line of perfect calibration in the reliability diagram. Thus, the accurate classifier naturally has a lower ECE than if confidence values were more uniformly spread.

Fifth, ECE depends on the scale of probabilities. If many probabilities are small (e.g., 0.001), this results in a small ECE even if the achieved accuracy is also small but is many factors times the confidence (e.g., 0.01, a factor of 10 difference; Matsubara et al., 2023).

Sixth, ECE is a highly discontinuous function of classifier confidence values due to its fixed width bins. This makes it a difficult metric to use in gradient-based optimization schemes (Kumar et al., 2018), and a small change to a single confidence value could have a large effect on the overall ECE.

Seventh, for a model with a certain fixed calibration performance, the value of ECE decreases as the number of datapoints used to compute it increases (Gruber & Buettner, 2022). This complicates comparisons across datasets of different sizes.

Due to these shortcomings, several variations of ECE have been proposed. These are described in subsequent sections. However, we first describe some general definitions for binary and multiclass classifiers that aid discussion of such metrics.

4.3. Mean Calibration Error

The $l_{p}$ calibration error (CE), or $l_{p}$ -ECE or mean calibration error, is defined by Lee et al. (2023) as:

\begin{aligned} C E = {(\sum_{b = 1}^{B} p_{b} {| {\bar{y}}_{b} - {\bar{c}}_{b} |}^{p})}^{1 / p} \end{aligned}

(15)

This general definition incorporates several well-known specific metrics. Kumar et al. (2019) state that the most common value is $p = 2$ , and refer to this simply as the calibration error, while Hendrycks et al. (2019) call it root mean-square calibration error (RMSCE), and Hagopian et al. (2023) call it reliability-in-the-small or just reliability. However, $p = 1$ is also commonly used—this defines the standard ECE. Furthermore, $p = \infty$ , or the Chebyshev norm, defines the maximum calibration error (MCE), discussed in Section 4.5. In tests on synthetic data with known properties, hypothesis tests based on $l_{2}$ -ECE are shown to outperform $l_{1}$ -ECE, in terms of missed detections of miscalibration (Lee et al., 2023). Chidambaram and Ge (2025) consider $p = 2$ to be the “right” choice of $l_{p}$ -ECE, as it is the conditional expectation of the BS and is the only $l_{p}$ -ECE variant induced by a Bregman divergence. Bregman divergence-based metrics have convenient theoretical properties and are closely related to strictly proper scoring rules (Chidambaram & Ge, 2025). Values of p other than 1, 2, or $\infty$ are not typically used.

4.4. Binned Multiclass Calibration Error

For confidence bin b and class k, let the class-specific mean confidence and true proportion of data labels be denoted by ${\bar{c}}_{b k}$ and ${\bar{y}}_{b k}$ , respectively. If $w_{k} \in [0, 1]$ represents the importance of each class, the marginal classification error (MGCE), usually with $p = 2$ (Kumar et al., 2019), is:

\begin{aligned} M G C E = \sum_{k = 1}^{K} w_{k} {(\sum_{b = 1}^{B} p_{b} {| {\bar{y}}_{b k} - {\bar{c}}_{b k} |}^{p})}^{1 / p} \end{aligned}

(16)

If $w_{k} = 1 / K$ for equally important classes, and $p = 1$ , this is also called class-wise ECE (CWCE; Silva Filho et al., 2023), macro subset ECE (Obadinma et al., 2021), or static calibration error (STCE; Nixon et al., 2019). In this paper, we use the abbreviation STCE to avoid confusion with smooth calibration error (SCE); STCE is abbreviated to SCE by Nixon et al. If $w_{k}$ is in proportion to the number of datapoints in Class k, and $p = 1$ , it is known as the weighted subset ECE (WSECE; Obadinma et al., 2021). The unweighted terms of the outer sum in equation (16), may be considered class-stratified versions of the generalized ECE metric (Röchner et al., 2024).

For full multiclass-ECE, probability vectors could in theory be binned in simplex space. The difference between the mean probability vector and the vector of class proportions in each bin could then be computed. However, most bins would likely be empty or have very few datapoints (Silva Filho et al., 2023). This can be seen as follows. If each dimension (class) is subdivided into G grid values, then the total number of bins is $G^{K - 1}$ . As a concrete example, if a coarse grid is defined with only $G = 5$ divisions of width 20% each and there are $K = 4$ classes, then the total number of bins in the simplex is 125. This is already much higher than the typical number of bins used in the calculation of ECE. Considering many problems have more classes than this, full multiclass binned metrics are not a practical measure for $K \geq 4$ .

4.5. Variants of ECE

Several minor variants to ECE have been proposed. Rather than using equal-width confidence bins, adaptive calibration error (ACE) by Nixon et al. (2019) uses bins based on fixed percentiles of confidence scores in the test dataset, so that each bin has the same number of datapoints. ACE is computed class-wise over all classes for multiclass problems, like the static calibration error. ACE is also called equal-mass (EM) ECE, or ECE-EM (Roelofs et al., 2022). Estimators with bins of equal mass have lower bias than estimators with bins of EW (Roelofs et al., 2022). EW and EM methods of binning are also referred to as width binning and frequency or quantile binning, respectively (Silva Filho et al., 2023). A disadvantage of ECE-EM is that some parts of the confidence space may have wide bins, preventing the ability to model variation in accuracy in those regions. The equal-area ECE (ECE-EA), or “equiareal ECE,” has bins with approximately equal area, providing a middle ground between ECE-EM and ECE-EW (Röchner et al., 2024). Thresholded ACE (TACE) uses frequency binning but only includes datapoints with a confidence value above a certain threshold. The logic behind this is that in situations with many classes, many class probabilities are low, and this washes out the calibration score. A threshold of 0.01 is used by Nixon et al. (2019). When only the top k confidence datapoints are selected, the method is called ECE@k (Guilbert et al., 2024). ACE and TACE are both relatively robust to label noise, a situation where lower-rank predictions are more important (Nixon et al., 2019). Nixon et al. recommend that ACE generally be used in favor of other metrics or standard ECE. However, if the number of classes exceeds 100, they recommend TACE. Nevertheless, ACE and TACE only measure class-wise calibration, not full multiclass calibration.

Calibration can also be assessed by the area between the reliability plot and the ideal line, as seen in Figure 3. This is known as the area between curves (ABC), or the integrated calibration error, in Hagopian et al. (2023). The total absolute area is the standard ECE when the curve is constructed as a binned estimate. It is also possible to provide separate reports of the area above the curve, representing underconfidence in some regions, and below the curve, representing overconfidence in other regions (Hagopian et al., 2023). Reporting above and below areas is a useful decomposition of ECE (which is the sum of the two areas) to better understand under and overconfidence. However, the need to analyze two variables makes above/below ABC harder to use when assessing multiple recalibration algorithms automatically.

MCE is like ECE but instead of measuring an average of the calibration errors, MCE measures the largest calibration error. This is useful when it is important for a model to be extremely well calibrated across a range of confidence values. MCE is calculated by:

\begin{aligned} M C E = max_{b} | {\bar{y}}_{b} - {\bar{c}}_{b} | \end{aligned}

(17)

MCE can lead to unintuitive results when there is wide variance in calibration between histogram bins, which is more likely to happen when the test set is small. In these situations, the metric is highly sensitive to the placement of bins (Silva Filho et al., 2023). MCE may be most suitable for safety-critical applications, where it is important to understand the worst-case calibration at any confidence level. MCE is also called the $l_{\infty}$ or Cheyshev norm (De Leeuw et al., 2009).

Region-balanced ECE (RBECE) is defined by Dawkins and Nejadgholi (2022) as:

\begin{aligned} R B E C E = \frac{1}{| Θ |} \sum_{b \in Θ} | {\bar{y}}_{b} - {\bar{c}}_{b} | \end{aligned}

(18)

In (18), $Θ$ is the set of eligible bins that each contains a minimum number of datapoints. The metric gives equal weight to all bins rather than weighting them by sample size. The logic behind this is that in standard ECE some bins may have a small number of datapoints and thus a high variance. Conversely, in equal mass ECE, the bins may all be concentrated in a particular part of the confidence space, for example, if a model outputs many high confidence estimates, ECE-EM will be biased towards high-confidence calibration. RBECE provides a middle ground between the two extremes (Dawkins & Nejadgholi, 2022). However, the metric does ignore data in sparse regions.

The monotonic sweep calibration error ECE-SWEEP chooses the largest number of bins for which the bin heights, as computed by standard ECE, are monotonic. When tested on data, it is found that the optimal bin count grows with sample size (Roelofs et al., 2022). An efficient implementation of the sweep method for choosing the number of bins is the bin count search (BCS) method—see Section 4.12 for its application as part of a debiased metric. ECE-SWEEP has a lower bias than several other metrics, including standard ECE—see Section 4.13. As with many other metrics, the disadvantage of ECE-SWEEP is that its standard definition is for binary rather than multiclass classification. However, measurement of class-wise calibration can be achieved through an OVR strategy.

4.6. Partially Binned Metrics

Distance from calibration error (DCE) is introduced as a theoretical measure by Błasiok et al. (2023). Angelopoulos et al. (2025) provide a bin-based empirical estimate of DCE with bins of EW for binary classifiers. Let the confidence $c_{b}$ assigned to a bin be the upper edge of the bin. The estimate of DCE is then:

\begin{aligned} D C E = \frac{1}{N} \sum_{b = 1}^{B} \sum_{i \in b} | y_{i} - c_{b} | \end{aligned}

(19)

This metric operates on labels as points but confidences as bins—a partial binning strategy. The DCE estimate in equation (19) is shown by Angelopoulos et al. (2025) to be a tight upper bound on the true DCE for a sensible choice of B as the sample size N grows.

The Sanders-modified BS (SMBS), or Murphy's BS, is a partially binned version of the BS (Hagopian et al., 2023). Like DCE, SMBS uses binning to estimate aggregate confidence values in a bin while keeping the labels as individual points. A subtle difference between DCE and SMBS is that DCE uses max aggregation, but SMBS uses mean aggregation to obtain ${\bar{c}}_{b k}$ . SMBS is defined as:

\begin{aligned} S M B S = \frac{1}{K} \frac{1}{N} \sum_{k = 1}^{K} \sum_{b = 1}^{B} \sum_{i \in b} | y_{i k} - {\bar{c}}_{b k} |^{2} \end{aligned}

(20)

For real datasets, the difference between SMBS and BS is small (Hagopian et al., 2023). Apart from the difference in aggregation strategy, SMBS is the $l_{2}$ version of DCE. A disadvantage of DCE and SMBS is that, like point-based metrics, they only achieve a perfect score of zero when all confidences are zero or unity and the model is always correct.

Label-binned calibration error (ECE-LB) uses binning to estimate true proportions of labels ${\bar{y}}_{b}$ but operates on individual samples $c_{i}$ . This contrasts with DCE and SMBS, which bin confidence values but not labels. ECE-LB is defined by Roelofs et al. (2022) as:

\begin{aligned} E C E_{L B} = {(\frac{1}{N} \sum_{b = 1}^{B} \sum_{i \in b} {| {\bar{y}}_{b} - c_{i} |}^{p})}^{1 / p} \end{aligned}

(21)

ECE-LB is at least as large as standard ECE. It has the advantage over many other binned metrics that it considers the variation of confidence values in each bin rather than relying only on their mean. ECE-LB is named probability deviation error in Torabian and Urner (2024), where it is shown to have a lower bias than ECE.

4.7. Fit-on-the-Test ECE

The Fit-on-the-test (FOTT) calibration error is a metric subfamily that compares a calibration map $\hat{y} (c)$ estimated from data, including bin-based maps, to confidence values from models. ECE-FOTT is defined by Kängsepp et al. (2025) as:

\begin{aligned} E C E_{F O T T}^{α} = \frac{1}{N} \sum_{i = 1}^{N} | \hat{y} (c_{i}) - c_{i} |^{α} \end{aligned}

(22)

Kängsepp et al. (2025) define a binning scheme where, instead of assuming constant probabilities based on the mean within a bin, bin probabilities may be non-continuous piecewise linear functions of the input confidence. This binning scheme defines $\hat{y} (c)$ . Under the FOTT paradigm, classical ECE is equivalent to the subfamily of such functions where the slope of each bin is unity. This leads to the “tilted-roof” reliability diagram, where the bin roofs all have angles of 45° to the horizontal. This makes it easier visually to assess the amount of miscalibration in comparison to the perfect calibration identity line, also at 45°. An example of a tilted-roof reliability diagram is shown in Figure 5, for the same input data as Figure 4.

Figure 5.

Tilted-roof reliability diagram for 500 labeled datapoints, with 10 bins. The diagram facilitates comparison of bar heights with the perfect calibration line.

The number of bins used with ECE can be selected through cross-validation by optimizing the ECE-FOTT loss. The optimum number of bins for a calibration task containing 5,000 datapoints was 14. It is interesting that this is in the range of 10–20 bins that are usually arbitrarily used in standard ECE calculations. This suggests the number of bins typically used is sensible. However, according to Popordanoska (2025), there is no optimal default number of bins, since every scenario has its own bias-variance tradeoff.

4.8. Multipartition Metrics

The interval calibration error (ICE) is a theoretical metric that averages the ECE over all possible bin widths and locations. In practice, it cannot be computed directly, as the number of possibilities is huge for large datasets. However, a surrogate ICE (SICE) metric can be computed through Monte Carlo sampling in two stages as outlined by Błasiok et al. (2023). The first stage computes a metric called the random ICE (RICE), which is based on a modification of the bin-based ECE. For a single Monte Carlo run of RICE, all bin widths, except two, are set according to $w = 2^{- k}$ for some integer k. The first bin width r is chosen randomly in the range $[0, w]$ and this determines the final width, as all bin widths must sum to unity. RICE is defined, with bin intervals denoted by $B i n_{b} (2^{- k}, r_{m})$ and M Monte Carlo runs, as:

\begin{aligned} R I C E (k) = \frac{1}{M} \sum_{m = 1}^{M} \sum_{b = 1}^{B} \frac{1}{n_{b}} \sum_{i \in b} | y_{i} - c_{i} | I [c_{i} \in B i n_{b} (2^{- k}, r_{m})] \end{aligned}

(23)

A maximum number of bins to test is chosen via $k^{*} = ⌊ - \log_{2} (ε / 2) ⌋$ for a maximum estimation error of $ε = 0.01$ . SICE is computed as the optimization over a series of exponentially smaller bin widths:

\begin{aligned} S I C E = min_{k = 0, \dots, k *} (R I C E (k) + 2^{- k}) \end{aligned}

(24)

The random aspect of SICE averages out the effect of discontinuous jumps in the calibration map at bin edges, making it a consistent estimator, but undesirable for repeatable assurance applications. In tests, SICE is better than standard ECE but not as good as the Laplace kernel calibration error (LKCE; see Section 5.5) or the smooth calibration error (SCE; see Section 5.7).

Another metric that aggregates over all possible intervals is the cutoff calibration error (CCE; Rossellini et al., 2025). This uses the same definition as the MCE in equation (17), but with generalized bin widths and locations. Unlike ECE, CCE is “testable,” defined as being able to test the hypothesis that the true theoretical metric for a calibrated system is less than a specified threshold, based on an estimate using finite data. Other advantages of CCE are that it has no adjustable parameters and is a continuous function of confidence values, unlike most bin metrics. A disadvantage is the need to perform search over intervals, which complicates implementation.

4.9. Signed Calibration Error Metrics

Signed calibration error metrics measure both under and overconfidence, a beneficial property not common among metrics. The expected signed calibration error (ESCE; Verhaeghe et al., 2023), or miscalibration score (MCS), is defined by Ao et al. (2023) as:

\begin{aligned} E S C E = \sum_{b = 1}^{B} p_{b} ({\bar{y}}_{b} - {\bar{c}}_{b}) \end{aligned}

(25)

The standard definition of this metric uses EW bins. To reduce the well-known high local variance of binned estimators, Verhaeghe et al. (2023) use the mean of this metric over uniformly distributed bin sizes in the range 0.005 to 0.05. This averaging process is also used by the authors for computing standard ECE, a process with some similarities to SICE. The range of ESCE is $[- 1, 1]$ . A classifier is overconfident if ESCE<0 and underconfident if ESCE>0. The magnitude of ESCE is equal to or less than ECE. If $| E S C E | < E C E$ this indicates the classifier is over or underconfident in different parts of the confidence space. Other signed bin-based metrics are discussed in Appendix A.10.

4.10. Soft Bin Metrics

Soft-binning ECE (SBECE) by Karandikar et al. (2021) uses soft binning to obtain a metric that is differentiable, allowing it to be used as a loss function to encourage a calibrated model while training using gradient descent. The first step is to define bin membership for confidence values. If $\bar{c}$ is the vector of bin centers, then the vector of bin probabilities, or bin membership function, is:

\begin{aligned} u (c_{i}) = s o f t m a x (\frac{- {(c_{i} - \bar{c})}^{2}}{τ}) \end{aligned}

(26)

In (26), $τ$ is a temperature parameter that controls the softness of the binning. As $τ \to 0$ the output tends to a one-hot vector, which is equivalent to standard hard binning. Bohdal et al. (2023) show that SBECE is relatively insensitive to temperature parameter values $τ$ in the range 0.0001 to 0.01.

The soft-binned size, confidence, and accuracy of bin b are:

\begin{aligned} s_{b} & = \sum_{i = 1}^{N} u_{b} (c_{i}) \end{aligned}

(27)

\begin{aligned} c_{b} & = \frac{1}{s_{b}} \sum_{i = 1}^{N} u_{b} (c_{i}) c_{i} \end{aligned}

(28)

\begin{aligned} a_{b} & = \frac{1}{s_{b}} \sum_{i = 1}^{N} u_{b} (c_{i}) y_{i} \end{aligned}

(29)

In a similar manner to the mean calibration error in equation (15), the SBECE is then computed as:

\begin{aligned} S B E C E = {(\sum_{b = 1}^{B} \frac{s_{b}}{N} {| a_{b} - c_{b} |}^{p})}^{1 / p} \end{aligned}

(30)

Differentiable ECE (DECE) is like SBECE but uses a different bin membership function (Bohdal et al., 2023). If the upper edges of bin b are $β_{b}$ , weights are defined as $w_{1} = [1, 2, \dots, B]$ , and biases are defined as $w_{o} = [0, - β_{1}, - (β_{1} + β_{2}), \dots, - \sum_{i = 1}^{B - 1} β_{b}]$ , then the probability of a confidence $c_{i}$ belonging to bin b is:

\begin{aligned} u (c_{i}) = s o f t m a x (\frac{(w_{1} c_{i} + w_{0})}{τ}) \end{aligned}

(31)

Like SBECE, DECE is relatively insensitive to temperature parameter values $τ$ in the range 0.0001 to 0.01. However, DECE better approximates ECE than SBECE and produces better calibrated models, as measured by ECE, when used during training. These soft-bin metrics are reminiscent of kernel-based metrics (see Section 5), with the membership functions over bins taking the role of kernel functions over datapoints.

Another soft-bin metric, the fuzzy calibration error, is discussed in Appendix A.18. An alternative differentiable ECE metric uses the LogSumExp function to soften the choice of class with the maximum confidence value (Wang et al., 2023).

4.11. Overlapping-Bin Metrics

One of the criticisms of bin-based metrics is the discontinuity at bin boundaries. The CalBin metric by Bella et al. (2013) addresses this issue by using overlapping bins of equal mass. The first bin is defined as the first s datapoints. The second bin has s datapoints $i = 2, \dots, s + 1$ . This sliding bin definition is repeated to the end of the dataset, and the errors are averaged over the bins. CalBin is defined as:

\begin{aligned} C a l B i n = \frac{1}{N - s} \sum_{b = 1}^{N - s} \sum_{i = b}^{b + s - 1} | {\bar{y}}_{b} - c_{i} | \end{aligned}

(32)

The value for s is arbitrary but Bella et al. use $N / 10$ .

Another overlapping bin metric is the k-nearest neighbors (KNN) ECE, or ECE-KNN, by Peng et al. (2023). This is defined in the same way as standard ECE, except that there is one bin per datapoint, and the other points in each bin are the k-nearest neighbors to the point in question, in terms of confidence value. A partially manual process for selecting k is given. ECE-KNN has a lower bias than ECE-EW, ECE-SWEEP, and ECE-DEBIAS when assessing uncalibrated models.

4.12. Debiased Metrics

Kumar et al. (2019) note that the so-called “plugin estimator” (PE) in equation (15) of the $l_{2}$ calibration error $E C E_{P E}$ is biased. A debiased (DB) estimator can be constructed by subtracting an approximate error term. For the $l_{2}$ error, the DB square calibration error (CE2-DB) is:

\begin{aligned} C E_{D B}^{2} = \sum_{b = 1}^{B} p_{b} ({| {\bar{y}}_{b} - {\bar{c}}_{b} |}^{2} - \frac{{\bar{c}}_{b} (1 - {\bar{c}}_{b})}{n_{b} - 1}) \end{aligned}

(33)

The square root of CE2-DB is called the debiased RMSCE (DRMSCE) by Petersen et al. (2023). Like the class-wise extension of ECE, the computational complexity of DRMSCE is $O (K N)$ , with only a simple additional debias term required. Therefore, the metric scales well as the number of classes grows. However, full multiclass calibration is not available due to requiring an unspecified debias term.

For the $l_{1}$ error, the DB estimator (ECE-DB) is approximated by:

\begin{aligned} E C E_{D B} = E C E_{P E} - ([\frac{1}{M} \sum_{m = 1}^{M} \sum_{b = 1}^{B} p_{b} | R_{b m} - {\bar{c}}_{b} |] - E C E_{P E}) \end{aligned}

(34)

In equation (34), $R_{b m} \sim N ({\bar{y}}_{b}, \frac{{\bar{y}}_{b} (1 - {\bar{y}}_{b})}{n_{b}})$ is a random sample from the normal approximation of the label distribution in bin b, and M is the number of Monte Carlo samples (Kumar et al., 2019). The double summation computes an approximation of the expected value of ECE_PE, and the large parentheses contain the overall bias. The additional summation makes the computational complexity $O (M K N)$ . The stochastic nature of ECE-DB limits its reproducibility.

In the limit of infinite size datasets, both CE2-DB and ECE-DB take on values between zero and unity. However, for finite datasets, both metrics can have small negative values, due to the debiasing term. Bootstrap methods can be used for hypothesis testing. Using the biased plugin calibration error estimator to test for calibration leads to rejecting well-calibrated models too often (Kumar et al., 2019). That is, there are too many false alarms when attempting to detect miscalibrated models (Widmann et al., 2019). Therefore, the DB estimator should be used for more refined hypothesis testing. Kumar et al. (2019) use the equal mass binning strategy.

Petersen et al. (2023) define an adaptive BCS method to decide the number of bins to use with any binned metric. Like ECE-SWEEP, the method chooses the largest number of bins for which mean confidence values in bins, as computed by the base metric, are monotonic. BCS is more efficient than SWEEP as it uses interval search. The method is described in Appendix A.9, the pure version of which has no adjustable parameters. Petersen et al. (2023) compare the performance of ECE-EW, ECE-EM, and DRMSCE, each using BCS or a fixed $B = 15$ , while using synthetic data with known zero or nonzero miscalibration. The BCS versions of all metrics generally have a lower bias than non-BCS versions and DRMSCE-BCS outperforms all other metrics tested. This comparison positions DRMSCE-BCS as a good candidate for recommendation.

Xiong et al. (2023) show that confidence scores for data in sparse areas of the feature space tend to be overconfident, and scores in dense areas tend to be underconfident. If the scores for dense and sparse data lie in the same confidence bin, then these can cancel out, making a model seem more calibrated according to ECE than it really is. To mitigate this proximity bias or cancelation effect, the proximity-informed ECE (PIECE) metric is proposed. The metric bins datapoints by both confidence value and “proximity” value, where proximity is based on the mean distance of a datapoint to its 10 nearest neighbors according to their feature values. PIECE is defined as:

\begin{aligned} P I E C E = \sum_{h = 1}^{H} \sum_{b = 1}^{B} p_{b h} | {\bar{y}}_{b h} - {\bar{c}}_{b h} | \end{aligned}

(35)

The bins are equal mass, with the number of confidence bins $B = 15$ and number of proximity bins $H = 10$ . For the same dataset and model, PIECE is always at least as large as standard ECE, making it a stricter score. The metric can be used to assess the ability of recalibration algorithms to address the proximity bias problem (Xiong et al., 2023). However, it has the disadvantage that it needs access to feature vectors, which may not be available in all settings. Since bins are two-dimensional, each bin has fewer datapoints than ECE, making the variance higher. A similar metric to PIECE, that also analyses feature values, is the kernel local calibration error described in Section 5.5.

Yang et al. (2024) introduce the partitioned calibration error (PCE). This operates in a similar way to PIECE but is more general as it allows any grouping of datapoints based on confidence or feature values and includes the possibility of averaging over different partitions (ways of grouping) of the same dataset. If there are Q partitions of the data, each partition has $B_{q}$ groups, and a general loss function L is defined, PCE is defined as:

\begin{aligned} P C E = \sum_{q = 1}^{Q} p (q) \sum_{b = 1}^{B_{q}} p_{b} L ({\bar{y}}_{b}, {\bar{c}}_{b}) \end{aligned}

(36)

PCE takes on many other metrics as special cases. For example, if there is one partition of the dataset into B bins of confidence values, and the loss function is $L = | y - c |$ , then this is standard ECE. If there is one partition of the dataset putting each datapoint into its own group and $L = (y - c)^{2}$ then this is the BS. Other point-, bin-, and kernel-based metrics may similarly be obtained (Yang et al., 2024). The general PCE metric also encompasses the cumulative multi-calibration metric discussed in Section 6.2, which analyses possibly overlapping subpopulations of data. For notational convenience, equation (36) shows the mean accuracy ${\bar{y}}_{b}$ and confidence ${\bar{c}}_{b}$ , but other statistics may also be used. PCE allows the cancelation effect to be mitigated by grouping the data appropriately. However, as with PIECE, it has the disadvantage that grouping needs access to feature vectors when used in this way. Issues to do with bias and loss relating to grouping effects are discussed more by Perez-Lebel et al. (2023). PCE may also be classified as a multi-partition metric (see Section 4.8).

Pan et al. (2020) introduce the field-level ECE (FECE), also called Field-ECE, which measures calibration bias in sensitive input fields of interest to the decision-maker (e.g., protected characteristics of a person). The mathematical definition of FECE is the same as PIECE in equation (35) but with the H proximity bins replaced by H nonoverlapping partitions of the data based on feature values and a single bin for confidence values. A large FECE value indicates that predictions are biased in some part of the data, and examining the contributions from induvial bins reveals the location of bias. Experiments by Pan et al. (2020) use discrete variables as the sensitive field, setting H to the number of possible values. Kelly and Smyth (2023) define a metric called variable-based ECE (VECE) that is the same as FECE, and their experiments examine binned continuous feature values one variable at a time. VECE is shown to reveal location-based bias that is masked by ECE. Ranking features by decreasing VECE highlights variables that are worth investigating for bias.

4.13. Bias Analysis

A framework known as bias by construction is used by Roelofs et al. (2022) to analyze the bias of various metrics. In this framework, the probability estimates produced by classifiers for real data are used to fit a parametric model for the density of confidence values and calibration curves. The fitted models are then used as ground truth to generate large amounts of synthetic data. The framework has been used to compare ECE, ECE-DEBIAS, ECE-SWEEP and kernel density estimation (KDE) metrics. KDE is described in Section 5.6. The analysis also compares equal mass and EW versions of the ECE metrics. In all cases, equal mass versions of ECE show a lower level of bias than the EW versions as the number of samples increases. The ranking of bias, for realistic calibration curves, from least to most biased is: ECE-SWEEP, ECE-DEBIAS, ECE, KDE. Poor performance for KDE could be due to use of standard parameters of the metric, which may only have been optimized for the simple synthetic datasets used in Zhang et al. (2020). For perfectly calibrated classifiers, ECE-DEBIAS is better than ECE-SWEEP. Therefore, there is little to choose between ECE-DEBIAS and ECE-SWEEP. For either method, at least 500 samples are required to reliably detect a classifier with a 10% calibration error, and 10,000 samples are required to detect one with a 2% error.

A general problem with bin-based schemes is that the true calibration is unmeasurable with a finite number of bins. Whether equal-width or EM bins are used, standard binned metrics like ECE increase their value as the number of bins increase, for a fixed number of datapoints (Arrieta-Ibarra et al., 2022; Kumar et al., 2019). In the limit of infinite data, binned metrics underestimate the true calibration error, but for finite datasets ECE can under or overestimate the calibration error (Roelofs et al., 2022).

4.14. Hypothesis-Test Bin-Based Metrics

Hypothesis tests can be developed to assess the significance of differences of a measured calibration metric from zero, under the assumption that a model is well calibrated. These tests are usually based on a test statistic, which can on its own be a metric for calibration. Hypothesis tests can be developed for most metrics through bootstrapping schemes (see Appendix B.2 for details). Caution should be noted when interpreting p-values as most tests are based on an approximation. When the approximation does not hold well, a nonsignificant result for small samples could lead to undeserved claims of good calibration and a statistically significant result for large samples might unjustly represent trivial miscalibration (Van Calster et al., 2024). This section includes bin-based metrics specifically designed with hypothesis testing in mind.

The Hosmer–Lemeshow (HL) statistic has been a popular calibration measure of binary classifiers, especially in the medical field (Huang et al., 2020). It was originally used with logistic regression classifiers (Lee et al., 2023) but has wider applicability. The “c-statistic” version is equivalent to an equal mass binning scheme with $B = 10$ bins. The “h-statistic” version is equivalent to an EW binning scheme with the same number of bins. For either scheme, the test statistic is usually defined as:

\begin{aligned} H L = \sum_{b = 1}^{B} \frac{{(n_{b} {\bar{y}}_{b} - n_{b} {\bar{c}}_{b})}^{2}}{n_{b} {\bar{c}}_{b}} + \frac{{(n_{b} (1 - \bar{y_{b}}) - n_{b} (1 - \bar{c_{b}}))}^{2}}{n_{b} (1 - {\bar{c}}_{b})} \end{aligned}

(37)

This can be rewritten as:

\begin{aligned} H L = \sum_{b = 1}^{B} \frac{n_{b} {({\bar{y}}_{b} - {\bar{c}}_{b})}^{2}}{{\bar{c}}_{b} (1 - {\bar{c}}_{b})} \end{aligned}

(38)

The HL statistic takes on any nonnegative value and follows a chi-squared distribution with $B - 2$ degrees of freedom. From this, a standard chi-squared hypothesis test can be undertaken. A version of the HL statistic for measuring the class-wise performance of multiclass problems has also been defined (Silva Filho et al., 2023). The HL test has satisfactory power when used with at least 400 samples. However, one disadvantage of the statistic is that it is highly sensitive to small amounts of miscalibration, especially when the number of datapoints is large. This makes it a more difficult metric to work with when comparing classifiers that, in real situations, are unlikely to be perfectly calibrated. Furthermore, nonsignificant results can be produced for small samples. Consequently, the HL test is considered by some authors to be an outdated approach (Van Calster et al., 2024).

Test for calibration (T-Cal; Lee et al., 2023) is a hypothesis test based on a debiased PE (DPE) of $l_{2}$ -ECE.

\begin{aligned} D P E = \sum_{b = 1}^{B} p_{b} ({| {\bar{y}}_{b} - {\bar{c}}_{b} |}^{2} - \frac{1}{n_{b}^{2}} \sum_{i \in b} {(y_{i} - c_{i})}^{2}) \end{aligned}

(39)

DPE is similar in form to CE2-DB in equation (33) but with the debias term computed in a different way. In the limit of infinite size datasets, DPE takes on values between zero and unity. However, for finite datasets, the metric can take on small negative values, due to the debiasing term. For multiclass problems, the binning scheme used for DPE is an equal-volume partition of the simplex. However, other binning schemes could be used. To select the optimum number of bins, the basic T-cal test requires knowledge of the smoothness of the calibration map, which is not generally known in practice. Therefore, an adaptive scheme performs hypothesis tests for a range of numbers of bins and rejects the null hypothesis of calibration if any test is rejected. In tests on synthetic data with known properties, T-cal is shown to outperform Cox's method (see Section 5.11) and $l_{1}$ -ECE, in terms of missed detections of miscalibration.

Sun et al. (2024) show that the distribution of $l_{2}$ -ECE may be approximated by a normal distribution and thus can be used to compute confidence intervals and perform hypothesis testing. Hypothesis tests based on this normal approximation or T-Cal have similar statistical power, but the normal approximation is computationally more efficient as it is based on analytic computation, whereas T-cal is based on Monte Carlo simulations.

5. Kernel-Based and Fitted-Curve Metrics

5.1. Introduction to Kernels and Fitted-Curve Metrics

Among the disadvantages of bin-based metrics are the arbitrary grouping of datapoints into bins, and discrete jumps between bins. Kernel or curve-based metrics fit a smooth calibration curve to the data, and the calibration error is based on the area between the fitted calibration curve and the perfect line (illustrated in Figure 2). With kernel or curve-based metrics, small changes in confidence values of individual datapoints result in small changes in metric values. Metrics in this family employ models of varying complexity to fit the data.

5.2. Kernel Basics

Kernel-based metrics fit a smooth calibration curve based on a locally weighted sum of datapoints. The weighting is defined by the kernel function k, which integrates to unity. Kernel estimates generally outperform bin-based metrics on criteria such as continuity of output with respect to input and rate of decline of mean-squared error with sample size (Galbraith & Van Norden, 2011).

The general difference-kernel estimation of the theoretical smooth calibration curve $\hat{y} (c)$ is:

\begin{aligned} \hat{y} (c) = \frac{1}{N} \sum_{i = 1}^{N} \frac{y_{i}}{h_{i}} k (\frac{| c - c_{i} |}{h_{i}}) \end{aligned}

(40)

The parameter $h_{i}$ is known as the bandwidth, or simply the width. Large widths smooth out the estimate but are unable to model sharp changes in gradient; small widths make the estimate depend more on the local data structure but may be prone to over-fitting. It is typical for the width to be the same for all datapoints. Three common fixed-width methods for choosing h are:

Normal rule of thumb. This sets $h = 1.06 σ N^{- 1 / 5}$ , where $σ$ is the standard deviation of the samples. This is an optimum choice if the underlying distribution is Gaussian and Gaussian kernels are used (Zhang et al., 2020).

Median heuristic. This is the median of the distance between all pairs of samples (Glaser et al., 2023).

Cross validation. This divides the data into cross-validation folds and computes the performance of the system on a held-out fold while also varying values of h. The overall performance curve for a particular value of h is computed as the mean of the performance from held-out folds. The optimum of the performance curve is used to select h (Galbraith & Van Norden, 2011).

Adaptive widths can improve on fixed-width estimates, which perform poorly in sparse parts of the distribution (Ristic et al., 2008). Adaptation makes the widths larger in regions with fewer datapoints. Detecting miscalibration is only possible with a finite dataset when the conditional probabilities of the classes are sufficiently smooth functions of the predicted confidences, such as kernel-based ones (Lee et al., 2023). Smoothness is implied in many calibration measurement schemes but is not usually directly addressed. Unless otherwise stated, kernel-based metrics take on values between zero and unity and have computational complexity $O (N^{2})$ .

5.3. Binary Classifier Kernel Metrics

The mean squared calibration error (MSCE) is defined by Galbraith and Van Norden (2011) as:

\begin{aligned} M S C E = \frac{1}{N} \sum_{i = 1}^{N} (\hat{y} (c_{i}) - c_{i})^{2} \end{aligned}

(41)

The Gaussian kernel function is used to compute $\hat{y} (c)$ with a width of 0.08, which was optimized through cross-validation, although analysis showed that results are not sensitive to the choice of bandwidth.

Smooth ECE (SECE) is defined by Wang et al. (2023) as:

\begin{aligned} S E C E = \frac{1}{N} \sum_{i = 1}^{N} | \hat{y} (c_{i}) - c_{i} | \end{aligned}

(42)

The Gaussian kernel function is used to compute $\hat{y} (c)$ with a width of 0.01, which was selected via grid search. SECE should not be confused with a different kernel metric of identical name, but denoted SMECE, described in the following paragraph, or the similarly named SCE, described in Section 5.7.

Smooth ECE (SMECE) is defined by Błasiok and Nakkiran (2024) in a similar manner to SECE. However, it performs smoothing on the residual $y - c$ rather than the label y, which, in addition to several other design choices, gives it better mathematical properties. The metric uses a reflected Gaussian kernel to deal with edge effects in the interval $[0, 1]$ , which alleviates bias from the standard Gaussian kernel. SMECE is a consistent calibration measure and is efficient with respect to both sample complexity and runtime. Crucially, it is hyperparameter-free, as a specific scheme is used to choose an optimal kernel bandwidth parameter. A grid-based computation of the metric over G grid points is:

\begin{aligned} S M E C E = \frac{1}{N G} \sum_{g = 1}^{G} | \sum_{i = 1}^{N} (y_{i} - c_{i}) k_{σ *} (c_{i}, g / G) | \end{aligned}

(43)

The kernel bandwidth parameter $σ$ is the standard deviation of the unreflected Gaussian kernel. Its optimum value is set so that $σ * = S M E C E$ , which can be determined efficiently via binary search. Uncertainty quantification of SMECE can be determined via bootstrapping. The method behind the metric can be used to produce an associated reliability plot, including a confidence interval for accuracy values at each input confidence value.

Popordanoska et al. (2022) define a general kernel density estimator (KDE) and call it ECE-KDE. For the binary classification problem, a partially debiased beta KDE (BKDE) specialization is defined as:

\begin{aligned} E C E_{B K D E} = \frac{1}{N} \sum_{j = 1}^{N} \frac{{[\sum_{i \neq j} k (c_{j}, c_{i}) y_{i}]}^{2} - \sum_{i \neq j} {[k (c_{j}, c_{i}) y_{i}]}^{2}}{{[\sum_{i \neq j} k (c_{j}, c_{i})]}^{2} - \sum_{i \neq j} {[k (c_{j}, c_{i})]}^{2}} \end{aligned}

(44)

The beta kernel is defined as:

\begin{aligned} k (c_{j}, c_{i}) = c_{j}^{α_{i} - 1} (1 - c_{j})^{β_{i} - 1} \frac{Γ (α_{i} + β_{i})}{Γ (α_{i}) Γ (β_{i})} \end{aligned}

(45)

In equation (45), $α_{i} = \frac{c_{i}}{h} + 1$ and $β_{i} = \frac{1 - c_{i}}{h} + 1$ and h is the bandwidth parameter. BKDE is a proper scoring rule (Popordanoska, 2025). Further discussion about BKDE is given below in conjunction with its multiclass extension, the Dirichlet KDE (DKDE).

5.4. Dirichlet Kernel Density Estimator

Popordanoska et al. (2022) introduce the DKDE to measure strong calibration error in multiclass problems as a specialization of ECE-KDE. DKDE is known as ECEKDE in Gruber and Buettner (2022). DKDE is computed using the p-norm of a vector based on the label and confidence vectors, and is defined as:

\begin{aligned} E C E_{D K D E} = \frac{1}{N} \sum_{j = 1}^{N} {| \frac{\sum_{i \neq j} k (c_{j}, c_{i}) y_{i}}{\sum_{i \neq j} k (c_{j}, c_{i})} - c_{j} |}_{p}^{p} \end{aligned}

(46)

The Dirichlet kernel is:

\begin{aligned} k (c_{j}, c_{i}) = \frac{Γ (\sum_{k = 1}^{K} α_{i k})}{\prod_{k = 1}^{K} Γ (α_{i k})} \prod_{k = 1}^{K} c_{j k}^{α_{i k} - 1} \end{aligned}

(47)

In equation (47), $α_{i k} = \frac{c_{i k}}{h} + 1$ and h is the bandwidth parameter, as before. In experiments, the bandwidth is chosen from a list of possible values by maximizing the leave-one-out likelihood, resulting in typical optimal values of between 0.0001 and 0.001 for $N = 3000$ . The sensitivity of DKDE with respect to bandwidth is not reported. A debiasing scheme is available for $p = 1$ or $p = 2$ . With $p = 2$ , DKDE is a proper scoring rule (Popordanoska, 2025).

The advantages of BKDE and DKDE are that they are consistent estimators (unlike ECE or maximum mean calibration error [MMCE]), scalable with respect to number of classes (unlike ECE or Mix-n-Match), debiased (unlike Mix-n-Match or MMCE), and differentiable (unlike ECE). Computation of BKDE and DKDE takes $O (K N^{2})$ time. Scalability of kernel-based metrics with respect to the number of classes is not in general guaranteed (Popordanoska, 2025); DKDE achieves this by using an appropriate kernel. The property of differentiability allows BKDE and DKDE to be used as a target in gradient-based training algorithms (Popordanoska et al., 2022). An empirical test of DKDE by Gruber and Bach (2025) with $K = 10$ and $N = 10, 000$ found it took in the order of 20 s to compute the metric. This may be an acceptable waiting time for offline assessments of models.

5.5. Pairwise Comparison Metrics

Pairwise comparison metrics compare pairs of individual point calibration errors, using a kernel to weight contribution of each summation term. The metrics vary according to the kernel used, which terms to include in the summation, and whether the metric assesses class-wise or strong multiclass calibration.

The MMCE is a kernel-based error introduced by Kumar et al. (2018). The motivation behind this metric is to use it as a supplementary target during classifier training. The claim is that other train-time calibration methods based on entropy penalties or temperature smoothing usefully reduce aggregate calibration error but undesirably suppress legitimately confident individual predictions. MMCE can be computed from equation (48).

\begin{aligned} M M C E = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (y_{i} - c_{i}) (y_{j} - c_{j}) k (c_{i}, c_{j}) \end{aligned}

(48)

Choice of kernel function is arbitrary, but the implementation selected by Kumar et al. (2018) is the Laplacian kernel with a width of $h = 0.4$ :

\begin{aligned} k (c_{i}, c_{j}) = \exp (- \frac{| c_{i} - c_{j} |}{h}) \end{aligned}

(49)

A more complicated weighted version of MMCE equalizes the effect of correct and incorrect examples for multiclass problems, which result in imbalanced datasets when decomposed into binary problems. This is found to improve calibration results relative to using the unweighted version in equation (48). Although equation (48) is quadratic in the number of datapoints N, the training time for algorithms based on MMCE is only 10% longer than other linear-speed metrics, like NLL (Kumar et al., 2018).

Widmann et al. (2019) discuss more general kernel models and suggest that MMCE can only be used for binary classification problems. However, as identified by Kumar et al. (2018), the construction of the metric allows it to be applied when assessing the top-label (highest confidence) performance of multiclass classifiers. Furthermore, it can be used to measure class-wise calibration using a one-vs-all strategy.

The LKCE can be computed using the general kernel formula in equation (48), using the Laplace Kernel with fixed width $h = 1$ . The Laplace kernel results in a consistent calibration measure, which is not true for the widely used Gaussian kernel. Naïve computation of equation (48) takes $O (N^{2})$ time. However, an approximation described by Błasiok et al. (2023) allows this to be computed in $O (N)$ time. Performance curves for LKCE are like those for the SCE (see Section 5.7) but not so smooth. The LKCE curves are beneficially less variable than ICE.

The squared kernel calibration error (SKCE) is computed in a similar manner to MMCE (Widmann et al., 2019). However, whereas MMCE applies only to binary classifiers or top-label confidence, SKCE quantifies strong multiclass calibration and hence is more generally applicable. Several versions of SKCE have been defined, all based on a pairwise error term. This term is defined as:

\begin{aligned} h_{i j} = (y_{i} - c_{i})^{T} k (c_{i}, c_{j}) (y_{j} - c_{j}) \end{aligned}

(50)

Equation (50) has a similar form to the MMCE summand in equation (48). However, in equation (50) the $y_{i}$ and $c_{i}$ terms are length K vectors to account for full multiclass analysis and k is a $K \times K$ matrix-valued kernel. The matrix-valued kernel for SKCE is chosen as the product of the scalar Laplace kernel equation (49) and the $K \times K$ identity matrix. This choice allows computation in $O (K)$ time. The width parameter of the kernel is chosen using the median heuristic.

The biased (B) SKCE-B is defined as:

\begin{aligned} S K C E_{B} = \frac{1}{N^{2}} \sum_{i, j = 1}^{N} h_{i j} \end{aligned}

(51)

This is the multiclass extension of MMCE. However, the metric is biased and takes $O (K N^{2})$ time to compute. Nevertheless, it is a strictly proper metric (Gruber & Buettner, 2022).

The unbiased quadratic (UQ) SKCE-UQ is defined as:

\begin{aligned} S K C E_{U Q} = \frac{2}{N (N - 1)} \sum_{1 \leq i < j \leq N}^{N} h_{i j} \end{aligned}

(52)

This metric is unbiased but still takes $O (K N^{2})$ time to compute. A hypothesis test exists for the SCKE-UQ metric. However, this is based on forming a bootstrap estimate that takes $O (K N^{3} M)$ time to compute for M Monte Carlo bootstrap samples. Computation of this statistic may be prohibitive for large datasets.

The unbiased linear (UL) SKCE-UL is defined as:

\begin{aligned} S K C E_{U L} = \frac{1}{N / 2} \sum_{i = 1}^{N / 2} h_{2 i - 1, 2 i} \end{aligned}

(53)

This metric is unbiased and only takes $O (K N)$ time to compute. A hypothesis test exists for the SCKE-UL metric, based on an asymptotic approximation. If $Φ$ is the cumulative distribution of the standard normal distribution, and $\hat{σ}$ is the standard deviation of the terms in the summand of equation (53), then the p-value of the metric under the null hypothesis of perfect calibration is:

\begin{aligned} P (S K C E_{U L}) = 1 - Φ (\frac{S K C E_{u l}}{\hat{σ} \sqrt{N / 2}}) \end{aligned}

(54)

There are two issues with SKCE-UL. The first is that the value of the metric relies on the order in which the datapoints are presented to the algorithm—in equation (53) only adjacent pairs of datapoints contribute to the sum. This means that if the same inputs are shuffled, the computed metric may be different. This is an undesirable property. The second issue is that the metric effectively assumes the datapoints are randomly distributed with respect to their characteristics. However, certain data processing pipelines may sort datapoints by confidence value or true label. If that is the case, higher weightings will be encountered than expected on average, potentially resulting in overly high metric values.

The SKCE-UL and SCKE-UQ metrics theoretically lie in the range [0, 1]. However, for nearly perfectly calibrated models with true SKCE≈0, the metric can be slightly negative for certain arrangements of datapoints. This is part of the normal variance associated with computing an unbiased metric. Nevertheless, this property may harm interpretability.

A comparison is made between standard ECE and the three SKCE metrics using 10,000 synthetic datasets each containing 250 samples with known ground truth and calibrated and uncalibrated models. The ECE exhibits both negative and positive bias, whereas SKCE-B is theoretically guaranteed to be biased upward. Hypothesis testing using ECE and consistency resampling (see Section B.2) is found to be unreliable and this gets worse with more classes. The asymptotic approximations for the two unbiased SKCE metrics are good for moderate numbers of classes. However, for 100 classes, SKCE-UL exhibits some mild multimodality in its distribution of values over the datasets, and for 1,000 classes it is strongly bi-modal—see Figure 27 in Appendix J.2.3 of Widmann et al. (2019). SKCE-UQ appears to have good properties for all tests performed up to 1,000 classes.

In conclusion for SKCE, compared to ECE and SKCE-B, SKCE-UL may be preferred as it is unbiased, quick to compute, hypothesis tests are quick to compute, and it has reasonable performance. However, the dependency of SKCE-UL on the order of datapoints is a major disadvantage and it performs poorly for problems with more than 100 classes. The SKCE-UQ is preferable to SKCE-UL, as it is more stable. However, this is at the expense of potentially high computation time for very large datasets, and the Monte Carlo nature of its hypothesis test, which may be undesirable for assurance applications.

Vashistha and Farahi (2025) define the I-trustworthy framework and the associated kernel local calibration error (KLCE) metric, which considers variations of calibration in different parts of the feature space. The metric is defined in the same way as SCKE-UQ but $k (c_{i}, c_{j})$ in equation (50) is replaced by $k_{c} (c_{i}, c_{j}) k_{x} (x_{i}, x_{j})$ . The I-trustworthy framework analyses one class at a time so the metric only need be applied to binary classifiers. If the kernel is chosen to be diagonal, computational complexity scales linearly with the feature dimensionality. Hypothesis testing is performed via bootstrapping. The framework and metric enable calibration analysis of subpopulations, for example, by protected characteristics of a person. KLCE is sensitive to the choice of kernel and how its hyperparameters are optimized. In experiments, a radial basis function kernel is used (Vashistha & Farahi, 2025). The disadvantage of the metric is that it requires access to feature values, which may not always be possible. KLCE is a kernel version of the bin-based proximity-informed ECE (see Section 4.12).

5.6. Smoothed Kernel Density Estimator

Zhang et al. (2020) define a smoothed KDE (SKDE)-based ECE estimator. This metric is also referred to as Mix-n-Match by Popordanoska et al. (2022), due to its use with the Mix-n-Match recalibration method. The metric performs kernel smoothing for both the confidence estimates and the true labels. The KDE estimate of the confidences is:

\begin{aligned} \tilde{c} (c) = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{h_{i}} k (\frac{| c - c_{i} |}{h_{i}}) \end{aligned}

(55)

Computation of the full SKDE does not scale well with the number of classes due to the curse of dimensionality, so it is recommended to be used with the top-label or class-wise decomposition of multiclass classifiers into binary classifiers (Zhang et al., 2020). In the binary case, SKDE can be computed based on a grid approximation to an integral, with G points, as:

\begin{aligned} S K D E = \tilde{y} (c) = \frac{1}{G} \sum_{g = 1}^{G} | g / G - \hat{y} (g / G) |_{p}^{p} \tilde{c} (g / G) \end{aligned}

(56)

The triweight kernel with a fixed width based on the normal rule of thumb is used, since that kernel has been recommended for problems with limited support (Zhang et al., 2020). Analysis of the metric's construction shows that its computational complexity is $O (G^{K - 1} K N)$ . Synthetic data with known ground truth is used to compare the SKDE with standard binned ECE in Zhang et al. (2020). SKDE is shown always to outperform binning with the improvement especially high for small sample sizes.

5.7. Smooth Calibration Error

The SCE aims to provide a metric that formally varies smoothly with respect to changes in its inputs (Błasiok et al., 2023). The empirical SCE can be computed as:

\begin{aligned} S C E = \begin{matrix} \begin{matrix} max_{{z_{i}}} \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - c_{i}) z_{i} s u b j e c t t o \\ - 1 \leq z_{i} \leq 1, \forall i \end{matrix} \\ | z_{i} - z_{j} | \leq | c_{i} - c_{j} |, \forall i, j \end{matrix} \end{aligned}

(57)

The constraints in the definition of SCE produce an implicit weighting function z that is 1-Lipschitz smooth, that is, the magnitude of the local gradient is never larger than unity. The Lipschitz condition smooths out the contribution from each neighborhood of c. Unlike standard ECE, this metric is a consistent calibration measure. It provides smoother performance curves than LKCE and is less variable than ICE (Błasiok et al., 2023). The SCE is unusual among other metrics as it takes on values in the range $[- 1, 1]$ . The only other metrics that have this property are the bin-based ESCE, WSMCS, and DICI (see Section 4.9 and Appendix A.10). This property allows these metrics to measure under or overconfidence in addition to the degree of miscalibration.

5.8. Integrated Calibration Index and Error Percentile

The integrated calibration index (ICI) is a fitted-curve-based metric, and as such, it has some similarities with kernel-based metrics. It is similar to Cox's intercept and slope method (see Section 5.11) since it fits a calibration curve and then analyses the curve (Huang et al., 2020).

The development of the ICI was motivated by Harrell's Emax index, which is the MAE between a smooth calibration curve and the diagonal line of perfect calibration. The smoothed curve is obtained via the locally estimated scatterplot smoothing (LOESS) algorithm. This is a nonparametric regression algorithm that is also called the Savitzky-Golay filter when independent variables are a fixed width apart. The LOESS algorithm fits a low-degree polynomial to datapoints near to the point of interest. Austin and Steyerberg (2019) utilize a two-degree polynomial, 75% of the full dataset to contribute to each estimate, and a tri-cubic weighting to down-weight datapoints far from the estimation point.

The ICI is the weighted difference between observed and predicted probabilities, in which observations are weighted by the empirical density function of the predicted probabilities. From a theoretical perspective, ICI is given by equation (58), where $\hat{y} (c)$ is the smoothed estimate of the label proportion for a particular confidence (i.e., the calibration map) and $ϕ$ is the density function of the distribution of predicted probabilities.

\begin{aligned} I C I = \int_{0}^{1} | c - \hat{y} (c) | ϕ (c) d c \end{aligned}

(58)

The empirical ICI is:

\begin{aligned} I C I = \frac{1}{N} \sum_{i = 1}^{N} | c_{i} - \hat{y} (c_{i}) | \end{aligned}

(59)

A metric related to ICI is the error percentile (EP) metric, also called EX (Hagopian et al., 2023). This is the $P t h$ percentile of the absolute difference between observed and predicted probabilities. Common values of P are 90 and 50 to give the E90 and E50 metrics. E50 is the median absolute difference and E90 is the 90th percentile of the difference. Confidence intervals for ICI and EP can be estimated using bootstrap methods (Austin & Steyerberg, 2019).

ICI, EP, and Emax have advantages over other ways of measuring calibration: they have a simple interpretation and assign a greater weight to dense data areas, which suppress poor estimates from sparse areas. The three metrics were compared by Austin and Steyerberg (2019) using simulated data with known ground truth while examining the performance of correctly and incorrectly specified models. ICI tends to demonstrate more consistent behavior in tests than EP or Emax.

5.9. Estimated Calibration Index

The estimated calibration index (ECI; Van Hoorde et al., 2015), also called the expected calibration index in Maier-Hein et al. (2024), is similar in concept to ICI. The definition is:

\begin{aligned} E C I = \frac{1}{N K} \sum_{k = 1}^{K} \sum_{i = 1}^{N} | c_{i k} - \hat{y} (c_{i k}) |^{2} \end{aligned}

(60)

In equation (60), the calibration map $\hat{y}$ is estimated as a multinomial logistic regression model using a cubic spline. This equation is reminiscent of the BS, where $\hat{y}$ is replaced by the actual data labels. The advantage of ECI over BS is that is measures calibration only, rather than calibration and discrimination combined (Van Hoorde et al., 2015). ICI and ECI are curve versions of the label-binned calibration error (see Section 4.6).

5.10. Fit on the Test

Kängsepp et al. (2025) define a general FOTT paradigm, where parameters of a calibration function from a family of functions are fit to the data by minimizing the ECE-FOTT loss in equation (22) through cross validation. Section 4.5 describes how bin-based schemes are a particular family of functions, leading to the tilted-roof reliability diagram. Curve-based function families have also been assessed using this paradigm, as described below.

The piecewise linear (PL) method for evaluating calibration fits a PL calibration map $\hat{y} (c)$ , where parameters of the function are the number of segments, segment boundaries, and the value of the function at the boundaries. A separate calibration map is estimated for each cross-validation fold, and an ensemble average is used to determine the final $\hat{y} (c)$ . The ECE-PL metric for calibration is then given by equation (22). The optimum number of segments for a calibration task containing 5,000 datapoints was three (Kängsepp et al., 2025).

The PL in logit-logit space (PL3) method fits a continuous PL function to logit functions $h (p) = log (p / (1 - p))$ . The independent variable is $h (c)$ and the dependent variable is $h (\hat{y} (c))$ . When one piece (segment) is used, this is equivalent to temperature scaling, a recalibration (calibration improvement) method (Guo et al., 2017). When two pieces are used, this is approximately equivalent to beta scaling (Kull et al., 2017). As before, the ECE-PL3 metric for calibration is then given by equation (22).

Other families tested under the FOTT paradigm include ECE-EM, Platt scaling, beta scaling, isotonic regression, spline fitting, and intraorder preserving functions (Kängsepp et al., 2025).

FOTT metrics have been assessed both with synthetic data, where the true calibration map is known, and with real data, where the true calibration map can be estimated accurately when there exist magnitudes more data than used in computing the calibration metric under test. The dataset CIFAR-5m contains 5 million synthetic images created so that models trained on the CIFAR-10 dataset have similar performance and vice versa. Metrics are assessed based on three objectives: (a) the quality of the reliability diagrams; (b) the quality of the calibration error estimates; and (c) Spearman's rank correlation between the metric ranks and the (approximately) true calibration error ranks, when assessing recalibration methods. For objective (a), PL is the best metric on average, followed by PL3 and beta scaling. However, the beta scaling rank gets worse as the number of datapoints increases due to the small number of parameters in the model. For objective (b), the 15-bin ECE-EM or beta metrics are the best, depending on the task. PL and PL3 also perform well. For objective (c), isotonic regression is best followed by 15-bin ECE-EM. PL is better than average and outperforms PL3 (Kängsepp et al., 2025).

The above assessment does not provide a clear ranking of calibration metrics as the ranking varies by objective. However, PL generally ranks well, and 15-bin ECE-EM also surprisingly ranks well, especially for the important task of ranking recalibration methods. Since ECE-EM is now considered a “classical” method and PL is relatively easy to compute, these should be considered for use generally. In passing, it is noted that the extensive experiments by Kängsepp et al. (2025) required over 10,000 hours of computer time to complete. Thus, it would be costly to recreate similar bespoke experiments for new projects.

5.11. Cox Intercept and Slope

The Cox intercept and slope (CIS) method measures calibration by performing a regression of the observed outcome against the log odds of the predicted confidence (Huang et al., 2020):

\begin{aligned} \hat{y} = a + b l o g i t (c) \end{aligned}

(61)

Perfect calibration has intercept $a = 0$ (measuring calibration-in-the-large) and calibration slope parameter $= 1$ . If $a > 0$ this indicates overconfidence on average, and if $a < 0$ this indicates underconfidence on average. If $b > 1$ this indicates underconfidence for high probabilities and overconfidence for low probabilities. The converse is true for $b < 1$ . Both parameters take on any value on the real line.

The CIS method has some similarities with the PL3 method, with the fitted function for CIS being equivalent to the fit for PL3 with a single segment. However, ECE-PL3 computes an ECE-style measure of the area between the fitted curve and the line of perfect calibration, whereas CIS analysis looks at the values of the intercept and slope to understand calibration. Although CIS provides more information than ECE-PL3, the need to analyze two variables makes CIS harder to use when assessing multiple recalibration algorithms automatically, like the above/below ABC metrics (Section 4.5).

5.12. Hypothesis-Test Kernel or Curve-Based Metrics

The statistical beta calibration test (SBCT) fits a beta calibration curve $\hat{y} (c)$ to the calibration dataset, computes the area between this curve and the perfect calibration identity line, and uses the result to compute a p-value in a hypothesis test of whether the fitted curve is statistically significantly different from the identity (Kull et al., 2017). The test statistic A is defined as:

\begin{aligned} A = \int_{0}^{1} | \hat{y} (c) - c | d c \end{aligned}

(62)

The p-value is approximated by the quantile of A within the distribution $G a m m a (4.74, 1 / \sqrt{91 N})$ . The statistical beta calibration test process has some similarities to the FOTT paradigm (see Section 5.10), with A being like ECE-FOTT. However, ECE-FOTT is computed on a sample basis, which puts more emphasis on the regions of the calibration map with more samples, whereas the integral for A is computed on a uniform grid of c values. The uniform grid enables the possibility of using the above gamma distribution approximation, which is good when at least 300 calibration samples are used. The disadvantage of SBCT is that it assumes there is a parameterization of the beta distribution that is a good fit for the data. This may not be the case for complex calibration curves with more than one inflection point.

The parabolic Wald statistic (PWS) is defined by Galbraith and Van Norden (2011). A hypothesis test based on this statistic fits a parabola $\hat{y} = \sum_{j = 0}^{2} {\hat{θ}}^{(j)} c^{j}$ to the data ${(c_{i}, y_{i})}$ and determines whether the fitted coefficients are significantly different from those of the linear identity function $θ_{0} = (0, 1, 0)^{T}$ . If V is a consistent estimator of the parameter covariance matrix, the PWS is defined as:

\begin{aligned} P W S = (\hat{θ} - θ_{0})^{T} V^{- 1} (\hat{θ} - θ_{0}) \end{aligned}

(63)

The PWS metric takes on nonnegative numbers, is asymptotically distributed according to the chi-squared distribution, and standard significance tests can be constructed based on this. This statistic assumes the calibration curve may be well approximated by a parabola. This may be the case for some datasets but is not in general true if the calibration curve has a more complex character.

Gweon (2023) describes a Pearson chi-squared reliability statistic based on k-nearest-neighbors in the confidence prediction space and a Bayesian approach for estimating the expected power of the reliability test for different sample sizes. The use of nearest neighbors makes this metric reminiscent of kernel methods and ECE-KNN.

6. Cumulative Metrics

6.1. Empirical Cumulative Calibration Error

Reliability diagrams are usually based on binned estimates or sometimes kernel estimates, but the selection of width parameter for either type of representation can be arbitrary. Cumulative metrics sort datapoints by confidence and examine the difference between the cumulative accuracy and the perfect calibration line. The advantage of such metrics is that there is no need to set or estimate arbitrary parameters, such as bin width or curve parameters. Avoiding arbitrary choices helps when using metrics for regulatory compliance. The cumulative difference plot (CDP) is defined as $C D P_{j} = 0$ for $j = 0$ and for $j = 1, \dots, N$ as:

\begin{aligned} C D P_{j} = \frac{1}{N} \sum_{i = 1}^{j} (y_{i} - c_{i}) \end{aligned}

(64)

For a perfectly calibrated classifier, the CDP approximates a horizontal flat line at zero.

Two types of empirical cumulative calibration error (ECCE) are defined by Arrieta-Ibarra et al. (2022). The first is the maximum absolute deviation (MAD) of the CDP from zero. This is known as ECCE-MAD or the KS statistic.

\begin{aligned} E C C E_{M A D} = max_{1 \leq j \leq N} | C D P_{j} | \end{aligned}

(65)

The second error type is the range of the CDP. This is known as ECCE-R or the Kuiper statistic.

\begin{aligned} E C C E_{R} = max_{0 \leq j \leq N} C D P_{j} - min_{0 \leq j \leq N} C D P_{j} \end{aligned}

(66)

The variance of $C D P_{j}$ is:

\begin{aligned} σ_{n}^{2} = \frac{1}{N^{2}} \sum_{i = 1}^{N} c_{i} (1 - c_{i}) \end{aligned}

(67)

The statistic $E C C E_{M A D} / σ_{n}$ is equivalent in distribution to the maximum of the absolute value of standard Brownian motion (BM) over the unit interval [0, 1]. The statistic $E C C E_{R} / σ_{n}$ is equivalent in distribution to the range of such BM. This equivalence to BM can be used to develop a hypothesis test for each statistic and compute p-values (significance levels; Arrieta-Ibarra et al., 2022).

In Arrieta-Ibarra et al. (2022), the ECCE metrics are assessed and compared with standard ECE using synthetic data, where ground truth statistics can be computed. It is shown that the ECCE metrics can distinguish calibrated and miscalibrated classifiers as the number of samples grows large, but this is not the case for ECE. Analysis of classifiers applied to real datasets with many datapoints produces very low p-values for ECCE showing that such classifiers are statistically significantly miscalibrated. However, the effect size is small, as seen by the unnormalized ECCE statistics.

Gupta et al. (2021) define the KS calibration metric and describe it as a “binning-free calibration measure.” The final form of this metric is identical to ECCE-MAD, but it is derived in a slightly different way. A generalization of the metric also allows the assessment of whether the $r$ th most likely class is correct or the top r classes are correct.

The KS error is zero for perfect calibration and unity for completely uncalibrated systems. The metric can be construed as a percentage difference between two distributions, which aids interpretation of the metric. KS error can be shown to be a special case of kernel-based measures. The relationship between KS error and the cumulative distribution function (CDF) is the same as the relationship between MCE and the binned probability density function—they are both based on the maximum difference (Roelofs et al., 2022).

KS error is compared to RBS, ECE, and CWCE by Gruber and Buettner (2022). Only KS and RBS are consistent in value with respect to data size. KS error is used, along with ECE, KDE-ECE, MCE, and BS by Gupta et al. (2021) to assess various recalibration methods on an ImageNet recognition challenge. All metrics give similar rankings for the best recalibration methods. However, since the ground truth for this challenge is not known, it is not possible to determine which metric is best.

6.2. Cumulative Multi-Calibration Metric

Guy et al. (2025) discuss the concept of multi-calibration, which is to make sure that all specified subpopulations of the full dataset are calibrated. The Q, possibly overlapping, subpopulations could be bins based on confidence, groups of datapoints that have similar feature values (if it is desired that all parts of the feature space are calibrated), or a combination of both. The authors propose a metric to measure the multi-calibration error, which is computed as follows. Let $w_{i}$ be the weight of datapoint i and let $q_{j}$ be the first j datapoints of subpopulation q, sorted by ascending confidence value. The weighted CDP of subpopulation q is then:

\begin{aligned} W C D P_{q, j} = \frac{\sum_{i \in q_{j}} (y_{i} - c_{i}) w_{i}}{\sum_{i \in q} w_{i}} \end{aligned}

(68)

Equation (68) is a weighted version of equation (64), but only for a single subpopulation. A Kuiper statistic for the subpopulation is then computed as:

\begin{aligned} D_{q} = max_{0 \leq j \leq | q |} W C D P_{q, j} - min_{0 \leq j \leq | q |} W C D P_{q, j} \end{aligned}

(69)

The multi-calibration metric (MCM) is defined as the worst-case Kuiper statistic weighted by its signal-to-noise ratio under the hypothesis of perfect calibration, assessed over all specified subpopulations:

\begin{aligned} M C M = max_{0 \leq q \leq Q} \frac{D_{q} (\sum_{j \in q} w_{j}) \sum_{i = 1}^{N} \sqrt{c_{i} (1 - c_{i}) w_{i}^{2}}}{\sum_{i = 1}^{N} \sqrt{w_{i}} \sum_{j \in q} \sqrt{c_{j} (1 - c_{j}) w_{j}^{2}}} \end{aligned}

(70)

The general MCM metric can be used with any specification of subpopulation. One specific use case could be to analyze the treatment of people according to protected characteristics to ensure that any automated decision systems are well calibrated for all such characteristics. However, Guy et al. (2025) describe an optional automated method for generating subpopulations based on using a binary tree to partition the data using feature values. The method retains all internal nodes of the tree, resulting in overlapping subpopulations.

The range of MCM is $[0, 1]$ . The metric has similar advantages and disadvantages to ECCE-MAD and ECCE-R, with the additional advantage of sub-population analysis. MCM is the cumulative equivalent of the PCE discussed in Section 4.12.

6.3. Brownian Bridge Test

Sadatsafavi and Petkau (2024) specify a hypothesis test that makes use of the equivalence of CDPs with BM. The test is generally more powerful than that based on ECCE-MAD. The test is based on a Brownian bridge, as follows. Let $T = \sum_{i = 1}^{N} c_{i} (1 - c_{i})$ . Define time and location as:

\begin{aligned} t_{j} = \frac{1}{T} \sum_{i = 1}^{j} c_{i} (1 - c_{i}) \end{aligned}

(71)

\begin{aligned} S_{j} = \frac{1}{\sqrt{T}} \sum_{i = 1}^{j} (y_{i} - c_{i}) \end{aligned}

(72)

The maximum absolute value of the bridged random walk is:

\begin{aligned} B * = max_{1 \leq i \leq N} | S_{i} - t_{i} S_{N} | \end{aligned}

(73)

The Brownian bridge test (BBT) p-value is then:

\begin{aligned} B B T = 1 - C D F_{χ_{4}^{2}} (- 2 \log [2 Φ (- | S_{N} |)] - 2 \log [1 - G (B *)]) \end{aligned}

(74)

In equation (74) $Φ$ is the CDF of the standard normal distribution, $χ_{4}^{2}$ is the CDF of the chi-squared distribution with four degrees of freedom, and G is the CDF of the Kolmogorov distribution, which is widely implemented in statistical analysis software (Sadatsafavi & Petkau, 2024). Since BBT is a p-value, its range is $[0, 1]$ .

BBT was compared to the p-value produced by the ECCE-MAD hypothesis test of Arrieta-Ibarra et al. (2022), which is referred to as the BM test in Sadatsafavi and Petkau (2024). In small test datasets with an effective sample size of less than 30, the p-values of both methods are slightly biased upwards, so the tests are conservative (i.e., reject fewer tests than expected). BBT was more powerful than BM in all cases other than with pure mean shifts in calibration, and in those cases the power was only slightly lower (Sadatsafavi & Petkau, 2024). Based on these results, BBT should be preferred to BM.

7. Conclusion

7.1. Summary

This article analyzes the wide range of metrics used to assess the calibration of probabilities produced by ML models and organizes these metrics according to families: point, bin, kernel, curve, or cumulative metrics. For each family, Table 1 shows the number of metrics identified, how many have different nominal ranges of value, how many are proper, how many have a built-in associated hypothesis test, and the number that distinguish underconfidence from overconfidence. A list of general pros and cons for each family is shown in Table 2.

Table 1.

Summary of Calibration Metric Families.

Type	Number	Range	Range	Range	Range	Proper	Hypothesis	Assesses Under /
		$[0, 1]$	$[0, \infty]$	$[- 1, 1]$	$[- \infty, \infty]$		Test	Over Confidence
Point	20	12	6	0	2	10	1	4
Bin	44	40	1	3	0	0	7	4
Kernel	17	16	0	1	0	2	2	0
Curve	9	7	1	0	1	0	1	1
Cumulative	4	4	0	0	0	0	3	0

Table 2.

General Pros and Cons for Each Calibration Metric Family.

Type	Pros	Cons
Point	• Includes widely used metrics• Most are easy to understand• Most are easy to compute	• Mixes concepts of calibration and accuracy• Some are very imprecise• Some have adjustable parameters

Bin	• Includes ECE (the most widely used metric) and its variants• Many are easy to understand• Most are easy to compute	• Most metrics are biased• Many have adjustable parameters• Most are discontinuous functions of confidence• Most vary in value based on number of datapoints

Kernel	• Smoother than bin metrics• Generally unbiased or lower bias than bin metrics	• More complicated definition than other metrics• Can be slow to run for large datasets• Most have adjustable parameters

Curve	• Relatively simple definitions• Most have no adjustable parameters	• Some are slow to compute• Make assumptions about form of curve

Cumulative	• Assumption free• No adjustable parameters• Better than ECE	• Can (correctly) produce highly significant p-values for large datasets even when effect is small

ECE = expected calibration error.

Table 4 in Appendix D summarizes all the individual metrics discussed in this article. For each metric, the table lists: alternative names; family; value range; whether the metric is proper; whether a hypothesis test is available; whether they distinguish underconfidence from overconfidence; and pros and cons.

7.2. Analysis and Recommendations

With so many metrics to choose from, it is necessary to recognize good metric properties to aid their selection. Hagopian et al. (2023) define various characteristics that good calibration metrics should have, including reproducibility, representativeness, interpretability, parsimony, and computational efficiency. Three further potentially useful properties are cataloged in Table 4. Being proper or having a hypothesis test are advantages of a metric but are not necessary for its use. The third property of being able to distinguish under from overconfidence is a major advantage, but lack of this characteristic does not exclude the metric, depending on the use case. Further properties only applicable to certain metrics appear in the pros and cons columns of Table 4.

No single metric is better than all others, as each one has advantages under certain conditions. Summaries of various numerical comparisons are included in the individual metric descriptions above. However, each comparison only covers a small subset of all metrics described so it can be difficult to assess the complete portfolio of possible metrics. Nevertheless, recommendations can be made for a series of use cases.

The first use case is to have a metric that is generally applicable for classifier probability calibration and is used to assess the class-wise calibration performance of models or calibration improvement of recalibration algorithms. To avoid conflating the assessment of accuracy and calibration a pure calibration metric should be selected. Such a metric should at least have Hagopian's properties. One candidate is the standard equal-width $l_{1}$ ECE. This meets most of the requirements and is widely used and understood. However, complete representativeness is not achieved since the metric is a highly discontinuous function of confidence values—a small change in value could move a datapoint from one bin to another, affecting the contribution from both bins and the resulting metric. The metric has several other drawbacks as outlined in Section 4.2. A better metric is the equal-mass $l_{2}$ DRMSCE-BCS (see Section 4.12). DRMSCE-BCS is easy to compute, easy to interpret, debiased, has a hypothesis test based on bootstrapping, has no adjustable parameters, outperforms ECE and other metrics in experiments, and has good theoretical properties. The only major disadvantage of DRMSCE-BCS is that its standard definition is for binary rather than multiclass classification. Measurement of class-wise calibration can be achieved through a simple OVR strategy (see Section 2.3). However, full multiclass measurement would require definition of a multiclass debias term and a modification to the automated binning strategy.

Binned metrics can be “gamed” by models that set all confidence values to a single value or vector equal to the class proportions, producing good outputs according to the metric but not necessarily useful. Several authors, for example, Chidambaram and Ge (2025) and Silva Filho et al. (2023), therefore recommend that a proper score should be reported alongside any binned metric. The top two choices would be the BS or NLL due to their widespread use and understood properties. As discussed in Section 3.4, log metrics are highly sensitive to the presence of confidence values near zero, violating the representativeness criterion. Therefore, we recommend that the BS should be reported along with DRMSCE-BCS.

The second use case is full multiclass calibration. DRMSCE-BCS cannot be used for this purpose—no bin-based metrics have a reasonable extension to full multiclass calibration with $K > 3$ . Out of the other metrics that do not conflate accuracy and confidence, only the kernel-based DKDE (also called ECE-KDE) and SKCE metrics measure full multiclass calibration. Within the SKCE variants, SKCE-UQ is preferred (see Section 5.5). Both DKDE and SKCE-UQ are unbiased and can be computed in $O (K N^{2})$ time. Both metrics require setting the kernel bandwidth parameter, which is done via the median heuristic for SKCE-UQ and through cross-validation for DKDE. DKDE, as used for evaluation in the experiments by Popordanoska et al. (2022), is based on the 1-norm whereas SKCE-UQ is based on the 2-norm. A bootstrap method for hypothesis testing is defined and analyzed in Widmann et al. (2019) for SKCE-UQ. A similar bootstrap method could be defined for DKDE but has not been tested in Popordanoska et al. (2022). There is little to choose between DKDE and SKCE-UQ and there appear to be no direct comparisons of the two techniques in the literature. SKCE-UQ appears to have a slight advantage as it uses the 2-norm (for consistency with DRMSCE-BCS) and results of hypothesis tests are available.

The third use case is determining whether miscalibrated classifiers are under or overconfident. Several metrics meet this high-level requirement. Of the point-based metrics, EO is the only one not to have sensitivity to specific confidence values like ECD, NSES, or the Spiegelhalter z statistic. However, as a global metric EO lacks refinement. From the bin-based metrics, ESCE is preferred as it has a simple definition and there is no clear benefit to the more complex DICI and WSMCS. The variant of ESCE by Verhaeghe et al. (2023) that averages over different bin widths should be used to reduce bias. Consideration should be given to an EM extension to further reduce bias. If it is required to determine under and overconfidence separately, then the bin-based ABC should be used. The kernel-based SCE has a slightly complex definition and thus lacks interpretability. The curve-based CIS has a simple definition, but requires the analysis of two parameters, which complicates analysis. Based on the above analysis, we recommend ESCE as a single metric to measure under and overconfidence.

The fourth uses case is to measure the stronger notion of local calibration, where feature values are used in the assessment. Several metrics meet this high-level requirement: the bin-based PIECE, PCE, multi-view calibration error (MVCE), FECE, and VECE; the kernel-based KLCE and ECMMD; and the cumulative MCM. The metrics ECMMD and MVCE contain a random sampling element, harming their reproducibility. FECE and VECE use only one confidence bin, resulting in a global confidence metric for each feature-partition of the data, and an ensuing lack of refinement in confidence analysis. KLCE is the only metric that allows assessment of full multiclass calibration, and it has a defined hypothesis test based on boot strapping. KLCE is unbiased and is smoother than the remaining bin metrics of PIECE and PCE. The major disadvantage of KLCE is its sensitivity to the choice of kernel and how its hyperparameters are optimized. PIECE bins data in feature space by density. While this allows miscalibration detection in the presence of cancelation between different parts of the feature space, the location of miscalibrated areas cannot easily be identified. Based on the analysis above, we recommend a specialization of the general PCE metric that uses $l_{2}$ as the loss term (for consistency with DRMSCE-BCS and SKCE-UQ) and that bins both confidence and feature values. Since the number of bins grows exponentially with the number of feature dimensions, resulting in sparsely populated bins, single feature dimensions of interest should be analyzed one at a time.

The fifth and final use case is for developers and researchers to perform deeper analysis beyond the use of a single numerical metric for automated comparison and assurance. Models may be under or overconfident at different confidence levels. These variations in calibration can cancel each other out to different extents depending on the metric, resulting in ostensibly good metric values that mask a more subtle issue (Van Calster et al., 2024). Therefore, for refined analyses it is recommended that calibration curves with associated confidence intervals should be plotted on the reliability plot, like in Figure 3, and examined.

7.3. Further Work

This article has outlined the definition and properties of many metrics. Existing comparisons between metrics typically only assess a small number of metrics at a time, and these comparisons are often carried out by metric designers who perform analysis from an angle that may be beneficial to their metric. A large-scale independent study should compare the more promising metrics in this survey using data and models of known levels of miscalibration.

Certain specific lines of research would be useful. DRMSCE-BCS is a recommended metric for top-label and class-wise calibration but is not applicable to full multiclass calibration. A debiasing term and automated binning strategy should be developed for the multiclass setting, including an analysis of scalability as the number of classes grows. Sun et al. (2024) show that the distribution of $l_{2}$ -ECE may be approximated by a normal distribution and thus can be used to compute confidence intervals and perform hypothesis testing. It would be useful to determine whether this analytic process can be applied to DRMSCE-BCS so that bootstrapping may be avoided in hypothesis testing. DKDE and SKCE-UQ are promising metrics for multiclass calibration. A detailed comparison of the two metrics should be carried out, including their ability to perform hypothesis testing. Metrics for measuring local calibration are less analyzed than more general calibration metrics, so further work should be conducted in this area to benchmark performance under controlled conditions.

This survey focuses on the probability calibration of classifiers that produce confidence scores for a fixed set of unordered classes. Similar surveys should be carried out to catalog probability calibration metrics for other problems, including object detection, conformal classification with variable numbers of output classes, ordinal classification for ranked classes, and regression.

Finally, several open problems in probability calibration assessment are discussed in Silva Filho et al. (2023). These include characterising the epistemic uncertainty in models and assessing out-of-distribution inputs. These topics should be investigated further from a calibration point of view.

Footnotes

Acknowledgments

The author thanks Emma Tattershall, Sarah Pengelly, and William Wood for productive discussions about probability calibration issues and for reviews of early versions of this survey.

Ethical Considerations

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for initial literature survey and write-up activities was provided by the Defence Science Technology Laboratory (DSTL) under Task 267 in the Analysis for Science and Technology Research in Defence contract DSTL/AGR/01142/01. Subsequent survey activities and the production of this article were funded by the QinetiQ Fellow scheme.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

Since this article is a review article, no datasets are analyzed or generated other than for illustrative purposes.

ORCID iD

Richard Oliver Lane

Notes

References

Angelopoulos

A. N.

Barber

R. F.

Bates

(2025). Theoretical foundations of conformal prediction. arXiv:2411.11824v3. https://doi.org/10.48550/arXiv.2411.11824

Rueger

Siddharthan

(2023). Two sides of miscalibration: Identifying over and under-confidence prediction for network calibration. Conference on Uncertainty in Artificial Intelligence (UAI), PMLR, 216, 77–87. https://proceedings.mlr.press/v216/ao23a

Arrieta-Ibarra

Gujral

Tannen

Tygert

(2022). Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research (JMLR), 23(1), 15886–15940. https://www.jmlr.org/papers/v23/22-0658.html

Austin

P. C.

Steyerberg

E. W.

(2019). The integrated calibration index (ICI) and related metrics for quantifying the calibration of logistic regression models. Statistics in Medicine, 38(21), 4051–4065. https://doi.org/10.1002/sim.8281

Bai

Mei

Wang

Xiong

(2021). Don’t just blame over-parametrization for over-confidence: Theoretical analysis of calibration in binary classification. In International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v139/bai21c.html

Barbera

Perini

De Toni

Passerini

Pugnana

(2025). Multiclass local calibration with the Jensen-Shannon distance. arXiv:2510.26566. https://doi.org/10.48550/arXiv.2510.26566

Barron

J.T

. (2019). A general and adaptive robust loss function. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2019.00446

Bella

Ferri

Hernández-Orallo

Ramírez-Quintana

M. J.

(2013). On the effect of calibration in classifier combination. Applied Intelligence, 38, 566–585. https://doi.org/10.1007/s10489-012-0388-2

Bihani

Rayz

J. T.

(2023). Calibration error estimation using fuzzy binning. In North American Fuzzy Information Processing Society Annual Conference. https://doi.org/10.1007/978-3-031-46778-3_9

10.

Błasiok

Gopalan

Nakkiran

(2023). A unifying theory of distance from calibration. In Annual ACM Symposium on Theory of Computing. https://doi.org/10.1145/3564246.3585182

11.

Błasiok

Nakkiran

(2024). Smooth ECE: Principled reliability diagrams via kernel smoothing. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=XwiA1nDahv

12.

Bohdal

Yang

Hospedales

(2023). Meta-calibration: Learning of model calibration using differentiable expected calibration error. Transactions on Machine Learning Research. https://openreview.net/forum?id=R2hUure38l

13.

Bröcker

(2009). Reliability, sufficiency, and the decomposition of proper scores. Quarterly journal of the royal meteorological society, A journal of the atmospheric sciences, Applied Meteorology and Physical Oceanography, 135(643), 1512–1519. https://doi.org/10.1002/qj.456

14.

Cabitza

Campagner

Famiglini

(2022). Global interpretable calibration index, a new metric to estimate machine learning models’ calibration. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction. https://doi.org/10.1007/978-3-031-14463-9_6

15.

Chatterjee

Niu

Bhattacharya

B. B.

(2024). A kernel-based conditional two-sample test using nearest neighbors (with applications to calibration, regression curves, and simulation-based inference). arXiv:2407.16550. https://doi.org/10.48550/arXiv.2407.16550

16.

Chen

Heckman

Julier

Ahmed

(2018). Weak in the NEES?: Auto-tuning Kalman filters with Bayesian optimization. In International Conference on Information Fusion (FUSION). https://doi.org/10.23919/ICIF.2018.8454982

17.

Chidambaram

Lee

McSwiggen

Rezchikov

(2024). How flawed is ECE? An analysis via logit smoothing. International Conference on Machine Learning (ICML), PMLR, 235, 8417–8435. https://proceedings.mlr.press/v235/chidambaram24a.html

18.

Chidambaram

(2025). Reassessing how to compare and improve the calibration of machine learning models. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=X0epAjg0hd

19.

Conde

Lopes

R. L.

Premebida

(2023). A theoretical and practical framework for evaluating uncertainty calibration in object detection. arXiv:2309.00464. https://doi.org/10.48550/arXiv.2309.00464

20.

Dawkins

Nejadgholi

(2022). Region-dependent temperature scaling for certainty calibration and application to class-imbalanced token classification. Annual Meeting of the Association for Computational Linguistics, 2, 538–544. https://doi.org/10.18653/v1/2022.acl-short.59

21.

de Leeuw

Hornik

Mair

(2009). Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software, 32(5), 1–24. https://doi.org/10.18637/jss.v032.i05

22.

(2021). Beyond strictly proper scoring rules: The importance of being local. Weather and Forecasting, 36(2), 457–468. https://doi.org/10.1175/WAF-D-19-0205.1

23.

Flach

(2019). Performance evaluation in machine learning: The good, the bad, the ugly, and the way forward. AAAI Conference on Artificial Intelligence, 33(01), 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808

24.

Flach

Song

(2020). How to assess and improve classifier confidence and uncertainty. A tutorial at ECML-PKDD 2020. Advanced Topics and Conclusion. Accessed from https://classifier-calibration.github.io

25.

Galbraith

J. W.

van Norden

(2011). Kernel-based calibration diagnostics for recession and inflation probability forecasts. International Journal of Forecasting, 27(4), 1041–1057. https://doi.org/10.1016/j.ijforecast.2010.07.004

26.

Gebel

(2009). Multivariate calibration of classifier scores into the probability space [PhD thesis]. University of Dortmund. https://d-nb.info/99741989X/34

27.

Glaser

Widmann

Lindsten

Gretton

(2023). Fast and scalable score-based kernel calibration tests. Conf. on Uncertainty in Artificial Intelligence, PMLR, 216, 691–700. https://proceedings.mlr.press/v216/glaser23a.html

28.

Gneiting

Raftery

A. E.

(2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. https://doi.org/10.1198/016214506000001437

29.

Gruber

S. G.

Bach

(2025). Optimizing estimators of squared calibration errors in classification. Transactions on Machine Learning Research (TMLR). https://openreview.net/forum?id=BPDVZajOW5

30.

Gruber

Buettner

(2022). Better uncertainty calibration via proper scores for classification and beyond. Advances in Neural Information Processing Systems, 35, 8618–8632. https://openreview.net/forum?id=PikKk2lF6P

31.

Guilbert

Caelen

Chirita

Saerens

(2024). Calibration methods in imbalanced binary classification. Annals of Mathematics and Artificial Intelligence, 92, 1319–1352. https://doi.org/10.1007/s10472-024-09952-8

32.

Guo

Pleiss

Sun

Weinberger

K. Q.

(2017). On calibration of modern neural networks. In International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v70/guo17a.html

33.

Gupta

Rahimi

Ajanthan

Mensink

Sminchisescu

Hartley

(2021). Calibration of neural networks using splines. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=eQe8DEWNN2W

34.

Gupta

Basu

Arora

(2023). How reliable are the metrics used for assessing reliability in medical imaging? International Conference on Medical Image Computing and Computer-Assisted Intervention. https://doi.org/10.1007/978-3-031-43898-1_15

35.

Guy

Haimovich

Linder

Okati

Perini

Tax

Tygert

(2025). Measuring multi-calibration. arXiv:2506.11251. https://doi.org/10.48550/arXiv.2506.11251

36.

Gweon

(2023). A power-controlled reliability assessment for multi-class probabilistic classifiers. Advances in Data Analysis and Classification, 17(4), 927–949. https://doi.org/10.1007/s11634-022-00528-0

37.

Hagopian

Plackowski

N. L.

Todd

Richards

J. A.

(2023). Confidence calibration metrics. Sandia Report SAND2023-14752. https://doi.org/10.2172/2430119

38.

Hebbalaguppe

Prakash

Madan

Arora

(2022). A stitch in time saves nine: a train-time regularizing loss for improved neural network calibration. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.01561

39.

Hendrycks

Mazeika

Dietterich

(2019). Deep anomaly detection with outlier exposure. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=HyxCxhRcY7

40.

Huang

Macheret

Gabriel

R. A.

Ohno-Machado

(2020). A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association, 27(4), 621–633. https://doi.org/10.1093/jamia/ocz228

41.

Huang

Wang

Mou

Zhang

Zhu

Zheng

(2022). MBCT: Tree-based feature-aware binning for individual uncertainty calibration. In ACM Web Conference. https://doi.org/10.1145/3485447.3512096

42.

Kängsepp

Valk

Kull

(2025). On the usefulness of the fit-on-the-test view on evaluating calibration of classifiers. Machine Learning, 114(4), Article 105. https://doi.org/10.1007/s10994-024-06652-6

43.

Karandikar

Cain

Tran

Lakshminarayanan

Shlens

Mozer

M. C.

Roelofs

(2021). Soft calibration objectives for neural networks. Advances in Neural Information Processing Systems, 34, 29768–29779. https://openreview.net/forum?id=-tVD13hOsQ3

44.

Kelly

Smyth

(2023). Variable-based calibration for machine learning classifiers. AAAI Conference on Artificial Intelligence, 37(7), 8211–8219. https://doi.org/10.1609/aaai.v37i7.25991

45.

Kull

Flach

P. A.

(2014). Reliability maps: A tool to enhance probability estimates and improve classification accuracy. In European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Nancy. France. https://doi.org/10.1007/978-3-662-44851-9_2

46.

Kull

Silva Filho

T. M.

Flach

(2017). Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2), 5052–5080. https://doi.org/10.1214/17-EJS1338SI

47.

Kumar

Sarawagi

Jain

(2018). Trainable calibration measures for neural networks from kernel mean embeddings. International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v80/kumar18a.html

48.

Kumar

Liang

P. S.

(2019). Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32. https://openreview.net/forum?id=rkxluVHeLB

49.

Küppers

Kronenberger

Shantia

Haselhoff

(2020). Multivariate confidence calibration for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) workshops. https://doi.org/10.1109/CVPRW50498.2020.00171

50.

Kuzucu

Oksuz

Sadeghi

Dokania

P. K.

(2025). On calibration of object detectors: Pitfalls, evaluation and baselines. In European Conference on Computer Vision (ECCV). https://doi.org/10.1007/978-3-031-72664-4_11

51.

Lee

Huang

Hassani

Dobriban

(2023). T-cal: An optimal test for the calibration of predictive models. Journal of Machine Learning Research (JMLR, 24(335), 1–72. https://www.jmlr.org/papers/v24/22-0320.html

52.

Liang

Dong

Zheng

Zhang

Chen

(2024). Calibrating deep neural network using Euclidean distance. arXiv:2410.18321. https://doi.org/10.48550/arXiv.2410.18321

53.

Lin

T. Y.

Goyal

Girshick

Dollár

(2017). Focal loss for dense object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/ICCV.2017.324

54.

Liu

Lin

Padhy

Tran

Bedrax Weiss

Lakshminarayanan

(2020). Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems, 33, 7498–7512. https://dl.acm.org/doi/abs/10.5555/3495724.3496353

55.

Luo

Bhatnagar

Bai

Zhao

Wang

Xiong

Savarese

Ermon

Schmerling

Pavone

(2022). Local calibration: Metrics and recalibration. Conference on Uncertainty in Artificial Intelligence, PMLR, 180, 1286–1295. https://proceedings.mlr.press/v180/luo22a.html

56.

Maier-Hein

Reinke

Godau

, et al. (2024). Metrics reloaded: Recommendations for image analysis validation. Supplementary Note 2.6. Nature Methods, 21, 195–212. https://doi.org/10.1038/s41592-023-02151-z

57.

Marx

Zalouk

Ermon

(2023). Calibration by distribution matching: Trainable kernel calibration metrics. Advances in Neural Information Processing Systems, 36, 25910–25928. https://dl.acm.org/doi/abs/10.5555/3666122.3667249

58.

Matsubara

Tax

Mudd

Guy

(2023). TCE: A test-based approach to measuring calibration error. Conference on Uncertainty in Artificial Intelligence (UAI 2023), PMLR, 216, 1390–1400. https://proceedings.mlr.press/v216/matsubara23a.html

59.

Menon

A. K.

Jiang

X. J.

Vembu

Elkan

Ohno-Machado

(2012). Predicting accurate probabilities with a ranking loss. International Conference on Machine Learning (ICML), 2012, 703–710. https://pubmed.ncbi.nlm.nih.gov/25285328

60.

Minderer

Djolonga

Romijnders

Hubis

Zhai

Houlsby

Tran

Lucic

Revisiting the calibration of modern neural networks. Advances In Neural Information Processing Systems, 34, 15682–15694. 2021. https://openreview.net/forum?id=QRBvLayFXI

61.

Moore

R. E.

Rosato

Maskell

(2022). Refining epidemiological forecasts with simple scoring rules. Philosophical Transactions of the Royal Society A, 380(2233), Article 20210305. https://doi.org/10.1098/rsta.2021.0305

62.

Mukhoti

Kulharia

Sanyal

Golodetz

Torr

Dokania

(2020). Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems, 33, 15288–15299. https://dl.acm.org/doi/abs/10.5555/3495724.3497006

63.

Munir

M. A.

Khan

M. H.

Sarfraz

Ali

(2022). Towards improving calibration in object detection under domain shift. Advances in Neural Information Processing Systems, 35. https://openreview.net/forum?id=a7YeDeacHpL

64.

Murphy

A. H.

(1973). A new vector partition of the probability score. J. Applied Meteorology, 12(4), 595–600. https://doi.org/10.1175/1520-0450(1973)012%3C0595:ANVPOT%3E2.0.CO;2

65.

Neumann

Zisserman

Vedaldi

(2018). Relaxed softmax: Efficient confidence auto-calibration for safe pedestrian detection. In NeurIPS Workshop on Machine Learning for Intelligent Transportation Systems (MLITS). https://openreview.net/forum?id=S1lG7aTnqQ

66.

Nixon

Dusenberry

M. W.

Zhang

Jerfel

Tran

(2019). Measuring calibration in deep learning. In Computer Vision and Pattern Recognition (CVPR) Workshops. https://arxiv.org/abs/1904.01685v1

67.

Obadinma

Guo

Zhu

(2021). Class-wise calibration: A case study on COVID-19 hate speech. In Canadian Conference on Artificial Intelligence. https://openreview.net/forum?id=jYEQ3Jo2ua

68.

Oksuz

Joy

Dokania

P. K.

(2023). Towards building self-aware object detectors via reliable uncertainty quantification and calibration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR52729.2023.00894

69.

Pan

Tang

Liu

Xiao

(April 2020). Field-aware calibration: A simple and empirically strong method for reliable probabilistic predictions. In ACM Web Conference. https://doi.org/10.1145/3366423.3380154

70.

Peng

Alamzadeh

Ester

(2023). Better calibration error estimation for reliable uncertainty quantification. In International Conference on Machine Learning (ICML). https://openreview.net/forum?id=jghR4uNTlA

71.

Perez-Lebel

Le Morvan

Varoquaux

(2023). Beyond calibration: Estimating the grouping loss of modern neural networks. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=6w1k-IixnL8

72.

Petersen

Ganz

Holm

Feragen

(2023). On (assessing) the fairness of risk score models. ACM Conference on Fairness. Accountability, and Transparency. https://doi.org/10.1145/3593013.3594045

73.

Popordanoska

Sayer

Blaschko

(2022). A consistent and differentiable lp canonical calibration error estimator. Advances in Neural Information Processing Systems, 35, 7933–7946. https://openreview.net/forum?id=HMs5pxZq1If

74.

Popordanoska

Tiulpin

Blaschko

M. B.

(2024). Beyond classification: Definition and density-based estimation of calibration in object detection. In IEEE/CVF Winter Conference on Applications of Computer Vision. https://doi.org/10.1109/WACV57701.2024.00064

75.

Popordanoska

(2025). Advancing calibration in deep learning: theory, methods, and applications [Doctor of Engineering Science (PhD) Thesis]. KU Leuven. Belgium. https://lirias.kuleuven.be/retrieve/795161

76.

Riley

R. D.

Ensor

Snell

K. I.

Debray

T. P.

Altman

D. G.

Moons

K. G.

Collins

G. S.

(2016). External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: Opportunities and challenges. BMJ, 353, Article i3140. https://doi.org/10.1136/bmj.i3140

77.

Ristic

La Scala

Morelande

Gordon

(2008). Statistical analysis of motion patterns in AIS data: Anomaly detection and motion prediction. In International Conference on Information Fusion. https://ieeexplore.ieee.org/abstract/document/4632190

78.

Röchner

Marques

H. O.

Campello

R. J.

Zimek

(2024). Evaluating outlier probabilities: Assessing sharpness, refinement, and calibration using stratified and weighted measures. Data Mining and Knowledge Discovery, 38, 3719–3757. https://doi.org/10.1007/s10618-024-01056-5

79.

Roelofs

Cain

Shlens

Mozer

M. C.

(2022). Mitigating bias in calibration error estimation. International Conference on Artificial Intelligence and Statistics (AISTATS), Valencia. Spain. https://proceedings.mlr.press/v151/roelofs22a.html

80.

Rossellini

Soloff

J. A.

Barber

R. F.

Ren

Willett

(2025). Can a calibration metric be both testable and actionable? arXiv:2502.19851. https://doi.org/10.48550/arXiv.2502.19851

81.

Sadatsafavi

Petkau

(2024). Non-parametric inference on calibration of predicted risks. Statistics in Medicine, 43(18), 3524–3538. https://doi.org/10.1002/sim.10138

82.

Silva Filho

Song

Perello-Nieto

Santos-Rodriguez

Kull

Flach

(2023). Classifier calibration: A survey on how to assess and improve predicted class probabilities. Machine Learning, 112(9), 3211–3260. https://doi.org/10.1007/s10994-023-06336-7

83.

Sumler

D. J.

Devlin

Maskell

Lane

R. O.

(2025). An entropic metric for measuring calibration of machine learning models. In TRUST-AI: The European Workshop on Trustworthy AI, Bologna. Italy. https://ceur-ws.org/Vol-4132/short53.pdf

84.

Sun

Chaudhari

Barnett

I. J.

Dobriban

(2024). A confidence interval for the l2 expected calibration error. arXiv:2408.08998. https://doi.org/10.48550/arXiv.2408.08998

85.

Tao

Dong

(2023). Dual focal loss for calibration. In International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v202/tao23a.html

86.

Tao

Zhu

Guo

Dong

(2024a). A benchmark study on calibration. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=GzNhzX9kVa

87.

Tao

Guo

Dong

(2024b). Consistency calibration: Improving uncertainty calibration via consistency among perturbed neighbors. arXiv:2410.12295. https://doi.org/10.48550/arXiv.2410.12295

88.

Tödter

Ahrens

(2012). Generalization of the ignorance score: Continuous ranked version and its decomposition. Monthly Weather Review, 140(6), 2005–2017. https://doi.org/10.1175/MWR-D-11-00266.1

89.

Torabian

Urner

(2024). Investigating calibrated classification scores through the lens of interpretability. In World Conference on Explainable Artificial Intelligence. https://doi.org/10.1007/978-3-031-63800-8_11

90.

Vaicenavicius

Widmann

Andersson

Lindsten

Roll

Schön

(2019). Evaluating model calibration in classification. In International Conference on Artificial Intelligence and Statistics. https://proceedings.mlr.press/v89/vaicenavicius19a

91.

Van Calster

Collins

G. S.

Vickers

A. J.

Wynants

Kerr

K. F.

Barreñada

Varoquaux

Singh

Moons

K. G.

Hernandez-Boussard

Timmerman

(2024). Performance evaluation of predictive AI models to support medical decisions: Overview and guidance. arXiv:2412.10288. https://doi.org/10.48550/arXiv.2412.10288

92.

Van Hoorde

Van Huffel

Timmerman

Bourne

Van Calster

(2015). A spline-based tool to assess and visualize the calibration of multiclass risk predictions. Journal of Biomedical Informatics, 54, 283–293. https://doi.org/10.1016/j.jbi.2014.12.016

93.

Vashistha

Farahi

(2025). I-trustworthy models. A framework for trustworthiness evaluation of probabilistic classifiers. International Conference on Artificial Intelligence and Statistics. PMLR, 258, 4726–4734. https://proceedings.mlr.press/v258/vashistha25a.html

94.

Vasilev

D'yakonov

(2023). Calibration of neural networks. arXiv:2303.10761. https://doi.org/10.48550/arXiv.2303.10761

95.

Verhaeghe

De Corte

Sauer

C. M.

Hendriks

Thijssens

O. W. M.

Ongenae

Elbers

De Waele

Van Hoecke

(July 2023). Generalizable calibrated machine learning models for real-time atrial fibrillation risk prediction in ICU patients. International Journal of Medical Informatics, 175, Article 105086. https://doi.org/10.1016/j.ijmedinf.2023.105086

96.

Wallace

B. C.

Dahabreh

I. J.

(2014). Improving class probability estimates for imbalanced data. Knowledge and Information Systems, 41(1), 33–52. https://doi.org/10.1007/s10115-013-0670-6

97.

Wang

Luo

Chen

Wang

Huang

(October 2023). Cal-SFDA: Source-free domain-adaptive semantic segmentation with differentiable expected calibration error. ACM International Conference on Multimedia. https://doi.org/10.1145/3581783.3611808

98.

Wang

Golebiowski

(2024). Towards unbiased calibration using meta-regularization. Transactions on Machine Learning Research (TMLR). https://openreview.net/forum?id=Yf8iHCfG4W

99.

Widmann

Lindsten

Zachariah

(2019). Calibration tests in multi-class classification: A unifying framework. Advances in Neural Information Processing Systems, 32. https://openreview.net/forum?id=rJxZ5SHl8B

100.

Wilks

D. S.

(2020). Regularized Dawid-Sebastiani score for multivariate ensemble forecasts. Quarterly J. Royal Meteorological Soc, 146, 2421–2431. https://doi.org/10.1002/qj.3800

101.

Lei

Zhou

(2023). Towards reliable rare category analysis on graphs via individual calibration. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.2629–2638. https://doi.org/10.1145/3580305.3599525

102.

Xiong

Deng

Koh

P. W. W.

Hooi

(2023). Proximity-informed calibration for deep neural networks. Advances in Neural Information Processing Systems, 36, 68511–68538. https://openreview.net/forum?id=8wUHbuyIzk

103.

Yang

J. Q.

Zhan

D. C.

Gan

(2024). Beyond probability partitions: Calibrating neural networks with semantic aware grouping. Advances in Neural Information Processing Systems, 36. https://openreview.net/forum?id=3kitbpEZZO

104.

Zhang

Kailkhura

Han

T. Y.-J.

(2020). Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v119/zhang20k.html

A Comprehensive Review of Classifier Probability Calibration Metrics

Abstract

Keywords

1. Introduction

2. Calibration Metric Concepts

2.1. Notation

2.2. Calibration Curve and Reliability Diagram

2.3. Multiclass Aspects

2.4. Other Observations on Calibration

2.5. Scope of Article

3. Point-Based Metrics

3.1. Introduction to Point-Based Metrics

3.2. Proper Scores

3.3. Brier Score

3.4. Logarithmic Metrics

3.5. Global Metrics

3.6. Normalized Square Metrics

3.7. P-Norm Metrics and Variants

4. Bin-Based Metrics

4.1. Introduction to Bin-Based Metrics

4.2. Expected Calibration Error

4.3. Mean Calibration Error

4.4. Binned Multiclass Calibration Error

4.5. Variants of ECE

4.6. Partially Binned Metrics

4.7. Fit-on-the-Test ECE

4.8. Multipartition Metrics

4.9. Signed Calibration Error Metrics

4.10. Soft Bin Metrics

4.11. Overlapping-Bin Metrics

4.12. Debiased Metrics

4.13. Bias Analysis

4.14. Hypothesis-Test Bin-Based Metrics

5. Kernel-Based and Fitted-Curve Metrics

5.1. Introduction to Kernels and Fitted-Curve Metrics

5.2. Kernel Basics

5.3. Binary Classifier Kernel Metrics

5.4. Dirichlet Kernel Density Estimator

5.5. Pairwise Comparison Metrics

5.6. Smoothed Kernel Density Estimator

5.7. Smooth Calibration Error

5.8. Integrated Calibration Index and Error Percentile

5.9. Estimated Calibration Index

5.10. Fit on the Test

5.11. Cox Intercept and Slope

5.12. Hypothesis-Test Kernel or Curve-Based Metrics

6. Cumulative Metrics

6.1. Empirical Cumulative Calibration Error

6.2. Cumulative Multi-Calibration Metric

6.3. Brownian Bridge Test

7. Conclusion

7.1. Summary

7.2. Analysis and Recommendations

7.3. Further Work

Footnotes

Acknowledgments

Ethical Considerations

Consent to Participate

Consent for Publication

Funding

Declaration of Conflicting Interests

Data Availability

ORCID iD

Notes

References