Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

arXiv cs.LG / 4/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses reward hacking in RLHF, where reinforcement learning against a learned reward model can lead to true response quality plateauing or degrading.
  • It argues that a key failure mode is “flipped advantage signs,” where an incorrect sign makes policy updates increase the likelihood of bad responses.
  • By applying adversarial perturbations in the reward model parameter space, the authors derive a certified sign-preservation radius indicating the minimum perturbation needed to flip the advantage sign.
  • They introduce Sign-Certified Policy Optimization (SignCert-PO), which down-weights policy-gradient contributions from non-robust (sign-unstable) completions.
  • Experiments on TL;DR summarization and AlpacaFarm benchmarks show improved win rates over baselines and reduced reward hacking, with the method requiring only the RM parameters and on-policy completions at optimization time.

Abstract

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.