Robust Optimization for Mitigating Reward Hacking with Correlated Proxies

arXiv cs.LG / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses reward hacking in reinforcement learning by designing robust agents trained with imperfect proxy rewards rather than assuming proxies perfectly match the true objective.
  • It reframes reward hacking as a robust policy optimization problem over all proxy rewards that satisfy an r-correlation constraint with the true reward, leading to a tractable max-min formulation against worst-case correlated proxies.
  • For cases where rewards are linear in known features, the method is extended to leverage that prior structure, producing improved policies and interpretable worst-case rewards.
  • Experiments across multiple environments show the proposed algorithms outperform ORPO in worst-case proxy returns and improve robustness and stability as the proxy–true reward correlation varies.
  • The authors release code publicly, enabling researchers to reproduce and build on the robustness/transparency approach.

Abstract

Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxy-true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at https://github.com/ZixuanLiu4869/reward_hacking.