The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a “Scaling Law of Miscalibration” in on-policy distillation, where models improve task accuracy but become systematically overconfident.
The root cause is framed as an information mismatch: teachers provide supervision using privileged context available during training, while deployed models must produce confidence from deployment-time information.
The authors formalize how teacher-conditioned success fails as a target for deployment-time confidence, and show that privileged context can lead to entropy collapse and optimism bias.
To fix this, they introduce CaOPD (calibration-aware on-policy distillation), which estimates empirical confidence via model rollouts and uses student-grounded confidence targets for distillation.
Experiments across models and domains indicate CaOPD achieves strong calibration (Pareto-optimal) while preserving competitive capability, with robust generalization to out-of-distribution and continual learning.

Abstract

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/21DailyView insight →

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Key Points

Abstract

💡 Insights using this article

Related Articles

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer