Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a common reinforcement learning alignment issue where optimizing scalar rewards under uncertain, inconsistent real-world objectives can cause reward hacking and overconfident behavior.
  • It proposes a dual-source uncertainty-aware reward framework that models both epistemic uncertainty (via ensemble disagreement in value estimates) and preference uncertainty (via variability in reward annotations).
  • The method combines these uncertainty signals using a confidence-adjusted Reliability Filter to adapt action selection, balancing exploitation with caution under ambiguity.
  • Experiments on discrete gridworlds and continuous control tasks (Hopper-v4, Walker2d-v4) show substantially reduced reward-hacking behavior, including a reported 93.7% reduction in trap-visitation frequency, with robustness to supervisory noise.
  • The improvements come with a trade-off: peak observed reward is reduced compared with unconstrained baselines, reflecting the cost of added safety through uncertainty handling.

Abstract

Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives--especially those derived from human preferences--are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and caution. Empirical results across multiple discrete grid configurations (6x6, 8x8, 10x10) and high-dimensional continuous control environments (Hopper-v4, Walker2d-v4) demonstrate that our approach yields more stable training dynamics and reduces exploitative behaviors under reward ambiguity, achieving a 93.7% reduction in reward-hacking behavior as measured by trap visitation frequency. We demonstrate statistical significance of these improvements and robustness under up to 30% supervisory noise, albeit with a trade-off in peak observed reward compared to unconstrained baselines. By treating uncertainty as a first-class component of the reward signal, this work offers a principled approach toward more reliable and aligned reinforcement learning systems.