A Perfectly Truthful Calibration Measure

arXiv stat.ML / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies calibration measures for probabilistic predictors and introduces a new measure, averaged two-bin calibration error (ATB), specifically designed to be perfectly and strictly truthful in the batch setting.
It addresses a key limitation of existing calibration measures: when calibration is evaluated on finite random samples, predictors may be incentivized to “lie” to appear better calibrated.
ATB is shown to be quadratically related to established measures (smCal and distCal) and is computationally simple, enabling efficient calibration testing.
The authors provide the first linear-time calibration testing algorithm in this context, improving on prior work by Hu et al. (2024).
They also propose a general construction recipe for truthful calibration measures using variance additivity, and demonstrate extensions such as quantile-binned l2-ECE.

Abstract

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.