Confidence Calibration under Ambiguous Ground Truth

arXiv cs.LG / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Confidence calibration can break down when multiple annotators genuinely disagree, because conventional post-hoc calibrators are typically trained against majority-voted single-label targets.
  • The authors identify a structural bias in Temperature Scaling under ambiguous ground truth, where the learned temperatures can underestimate annotator uncertainty and miscalibration grows with annotation entropy.
  • They propose ambiguity-aware, post-hoc calibration methods that optimize scoring rules over the full annotator label distribution without requiring model retraining.
  • Dirichlet-Soft (using full annotator distributions) delivers the best overall calibration quality, while MCTS Temperature Scaling with only one annotation can match full-distribution calibration and LS-TS can improve calibration using only voted labels via data-driven pseudo-soft targets.
  • Experiments on four multi-annotator and synthetic clinically-informed benchmarks show large ECE reductions versus standard Temperature Scaling, with Dirichlet-Soft achieving 55–87% lower true-label ECE and LS-TS achieving 9–77% lower ECE without annotator data.

Abstract

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model's own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.