RAGEN-2: Reasoning Collapse in Agentic RL

arXiv cs.LG / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

RAGEN-2 identifies a new failure mode in multi-turn LLM agent reinforcement learning—"template collapse"—where models produce seemingly diverse reasoning that is actually input-agnostic, evading detection by entropy-based metrics.
The paper proposes decomposing reasoning quality into within-input diversity (measured by entropy) and cross-input distinguishability (measured via mutual information), showing that mutual information correlates more strongly with final task performance than entropy across multiple tasks.
To enable online diagnosis, RAGEN-2 introduces mutual-information proxy metrics and demonstrates their effectiveness in detecting when reasoning stops responding to different inputs.
The authors explain template collapse using a signal-to-noise ratio (SNR) mechanism: low reward variance weakens useful task gradients so regularization dominates and removes cross-input differences.
As a mitigation, the paper introduces SNR-Aware Filtering that selects high-signal prompts per iteration using reward variance, improving both input dependence and performance in planning, math, web navigation, and code execution.

Abstract

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.

Black Hat Asia

AI Business

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Reddit r/MachineLearning

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Dev.to

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

Dev.to

Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic

Dev.to

RAGEN-2: Reasoning Collapse in Agentic RL

Key Points

Abstract

Related Articles

Black Hat Asia

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer