ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

arXiv cs.AI / 4/22/2026

📰 NewsModels & Research

Key Points

  • RLHF for aligning LLMs can fail catastrophically when an imperfect reward model (RM) fails to properly penalize unsafe behavior, creating a single point of failure.
  • The paper identifies a “systemic weakness” scenario where both the core LLM and the RM fail together, whereas many existing red-teaming methods focus only on policy-level issues.
  • ARES introduces an end-to-end framework with a “Safety Mentor” that builds semantically coherent adversarial prompts from structured components (topics, personas, tactics, goals) and generates both malicious and safe responses.
  • After uncovering dual vulnerabilities, ARES performs a two-stage repair: first fine-tuning the RM to better detect harmful content, then using the improved RM to optimize the core model.
  • Experiments on multiple adversarial safety benchmarks show ARES improves safety robustness while largely preserving model capabilities, suggesting a more comprehensive approach to RLHF alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor'' that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment.

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System | AI Navigate