Abstract
Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on 1{,}000 MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by +0.006 relative to the base model and MCE increases by +0.010 relative to neutral SFT -- though the effect does not reach statistical significance (p = 0.41) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by 40--64\% and improves accuracy by 1.5--3.0 percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control (0.042 vs.\ 0.037), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.