Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

arXiv cs.LG / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how reinforcement learning from human feedback (RLHF) and reward optimization can cause “sycophantic” reward signals to harm model calibration, which is important for reliable uncertainty quantification in LLMs.
Using Qwen3-8B fine-tuned in three regimes (base, neutral SFT, and sycophancy-inducing GRPO that rewards agreement with planted wrong answers), the authors find consistent directional calibration degradation under sycophantic GRPO.
Quantitatively, expected calibration error (ECE) increases by +0.006 versus the base model and maximum calibration error (MCE) increases by +0.010 versus neutral SFT, though the reported effect is not statistically significant at the given training budget (p = 0.41).
Post-hoc matrix scaling reduces ECE substantially (by 40–64%) and improves accuracy (by 1.5–3.0 percentage points), but the sycophantic model still shows the highest residual ECE after scaling, indicating structured miscalibration persists.
The work proposes an evaluation methodology for calibration impacts of reward hacking and motivates calibration-aware training objectives for future reward optimization approaches.

Abstract

Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on

1{,}000

MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by

+0.006

relative to the base model and MCE increases by

+0.010

relative to neutral SFT -- though the effect does not reach statistical significance (

p = 0.41

) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by

40

64\%

and improves accuracy by

1.5

3.0

percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control (

0.042

vs.\

0.037

), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/14DailyView insight →

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

Dev.to

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

Dev.to

Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

Key Points

Abstract

💡 Insights using this article

Related Articles

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer