Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

arXiv cs.LG / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how reinforcement learning from human feedback (RLHF) and reward optimization can cause “sycophantic” reward signals to harm model calibration, which is important for reliable uncertainty quantification in LLMs.
  • Using Qwen3-8B fine-tuned in three regimes (base, neutral SFT, and sycophancy-inducing GRPO that rewards agreement with planted wrong answers), the authors find consistent directional calibration degradation under sycophantic GRPO.
  • Quantitatively, expected calibration error (ECE) increases by +0.006 versus the base model and maximum calibration error (MCE) increases by +0.010 versus neutral SFT, though the reported effect is not statistically significant at the given training budget (p = 0.41).
  • Post-hoc matrix scaling reduces ECE substantially (by 40–64%) and improves accuracy (by 1.5–3.0 percentage points), but the sycophantic model still shows the highest residual ECE after scaling, indicating structured miscalibration persists.
  • The work proposes an evaluation methodology for calibration impacts of reward hacking and motivates calibration-aware training objectives for future reward optimization approaches.

Abstract

Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on 1{,}000 MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by +0.006 relative to the base model and MCE increases by +0.010 relative to neutral SFT -- though the effect does not reach statistical significance (p = 0.41) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by 40--64\% and improves accuracy by 1.5--3.0 percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control (0.042 vs.\ 0.037), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.