Confidence Calibration under Ambiguous Ground Truth
arXiv cs.LG / 3/25/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- Confidence calibration can break down when multiple annotators genuinely disagree, because conventional post-hoc calibrators are typically trained against majority-voted single-label targets.
- The authors identify a structural bias in Temperature Scaling under ambiguous ground truth, where the learned temperatures can underestimate annotator uncertainty and miscalibration grows with annotation entropy.
- They propose ambiguity-aware, post-hoc calibration methods that optimize scoring rules over the full annotator label distribution without requiring model retraining.
- Dirichlet-Soft (using full annotator distributions) delivers the best overall calibration quality, while MCTS Temperature Scaling with only one annotation can match full-distribution calibration and LS-TS can improve calibration using only voted labels via data-driven pseudo-soft targets.
- Experiments on four multi-annotator and synthetic clinically-informed benchmarks show large ECE reductions versus standard Temperature Scaling, with Dirichlet-Soft achieving 55–87% lower true-label ECE and LS-TS achieving 9–77% lower ECE without annotator data.
Related Articles
GDPR and AI Training Data: What You Need to Know Before Training on Personal Data
Dev.to
Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
AI Crawler Management: The Definitive Guide to robots.txt for AI Bots
Dev.to
Data Sovereignty Rules and Enterprise AI
Dev.to