Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
arXiv cs.LG / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper highlights that reward models used in RLHF are vulnerable to reward hacking, with prior attacks largely manipulating outputs in the semantic (human-readable) text space.
- It introduces Token Mapping Perturbation Attack (TOMPA), which performs adversarial optimization directly in token space to bypass the usual decode→re-tokenize step between policy and reward model.
- TOMPA uses only black-box scalar reward feedback to automatically find non-linguistic token patterns that trigger very high RM scores across multiple state-of-the-art reward models.
- When targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and exceeds them on 98% of prompts, while producing degenerate nonsensical text.
- The results suggest a critical vulnerability in current RLHF pipelines: reward models can be systematically exploited beyond the semantic regime, indicating limitations of semantic-only defenses.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




