Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper asks whether reinforcement learning (RL) truly expands the capability boundary of LLM agents or only improves reliability, extending prior “pass@k convergence” results from static reasoning to agentic tool use.
It introduces a new metric, PASS@(k,T), that jointly evaluates the sampling budget (k) and the interaction depth (T) to disentangle capability gains from efficiency gains.
The authors find that RL expands the capability boundary for tool-using agents: the RL pass curve rises above the base model’s, and the gap grows at larger k rather than converging.
This capability expansion is most pronounced on compositional, sequential information-gathering tasks, while on simpler tasks RL behaves as earlier work would predict (i.e., less boundary expansion).
With matched training data, supervised fine-tuning actually regresses on the same compositional tasks, and mechanism analysis suggests RL works by reweighting the base strategy distribution toward choices that more often lead to correct downstream reasoning, especially in integrating retrieved information.

Abstract

Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.