Confidence Calibration under Ambiguous Ground Truth

arXiv cs.LG / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Confidence calibration can break down when multiple annotators genuinely disagree, because conventional post-hoc calibrators are typically trained against majority-voted single-label targets.
The authors identify a structural bias in Temperature Scaling under ambiguous ground truth, where the learned temperatures can underestimate annotator uncertainty and miscalibration grows with annotation entropy.
They propose ambiguity-aware, post-hoc calibration methods that optimize scoring rules over the full annotator label distribution without requiring model retraining.
Dirichlet-Soft (using full annotator distributions) delivers the best overall calibration quality, while MCTS Temperature Scaling with only one annotation can match full-distribution calibration and LS-TS can improve calibration using only voted labels via data-driven pseudo-soft targets.
Experiments on four multi-annotator and synthetic clinically-informed benchmarks show large ECE reductions versus standard Temperature Scaling, with Dirichlet-Soft achieving 55–87% lower true-label ECE and LS-TS achieving 9–77% lower ECE without annotator data.

Abstract

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model's own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

Data Sovereignty Rules and Enterprise AI

Dev.to

Confidence Calibration under Ambiguous Ground Truth

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Data Sovereignty Rules and Enterprise AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer