Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Agent^2 RL-Benchは、LLMエージェントがエージェント的なRLポストトレーニングのための「完全なRLパイプライン」を自律設計・実装・実行できるかを評価するための新しいベンチマークを提案している。
ベンチマークは6タスクを3レベル構成で用意し、静的ルールベースから閉ループのオンラインRL（軌跡収集）へと進むにつれて、従来段階では課さない構造的要求を追加している。
分離された作業環境、採点用のAPI、提出物とコード改訂を記録する実行計測、さらに自動の事後解析による構造化レポート生成により、エージェント駆動ポストトレーニング挙動の自動診断を可能にしている。
複数のエージェントスタックと6つのドライバLLMで検証した結果、ALFWorldではSFTウォームアップ＋GRPO＋オンラインロールアウトにより改善が大きい一方、DeepSearchQAではほぼ改善しないなどタスク依存性が大きく、同じ枠組みでもドライバ選択がオンラインでの改善幅を大きく左右することが示された。

Abstract

We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training -- whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels -- from static rule-based training to closed-loop online RL with trajectory collection -- each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains -- on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts -- yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks -- within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench.