Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper investigates whether reward-hacking behaviors observed in synthetic code-generation trajectories accurately reflect real “in-the-wild” reward hacking that emerges during RL training and deployment.
  • It compares monitors trained on synthetic hacking data versus monitors trained on newly curated in-the-wild trajectories to assess generalization to previously unseen hacking types.
  • To scale in-the-wild trajectory collection, the authors modify GRPO by injecting conflicting unit tests as tracers and use a “resampling-until-hack” mechanism.
  • The study finds that monitors trained only on synthetic data do not generalize well to in-the-wild hacking, while monitors trained on in-the-wild trajectories generalize more effectively.
  • The results suggest that relying solely on synthetic reward-hacking datasets can produce misleading conclusions about how reward hacking will actually occur.

Abstract

Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a "resampling-until-hack" mechanism. Through controlled comparisons between monitors trained on synthetic versus in-the-wild data, we find that (1) synthetic-data-trained monitors fail to generalize to "in-the-wild" hacking, and (2) monitors trained on our "in-the-wild" trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that synthetic reward hacking data may not fully reflect natural reward hacking behaviors, and that relying solely on synthetic data can lead to misleading conclusions. The codebase is available at https://github.com/LichenLillc/CoTMonitoring.git