Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

arXiv cs.CL / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes arrol, an online rollout pruning approach for Reinforcement Learning with Verifiable Rewards (RLVR) that prunes trajectories during generation to reduce the heavy compute cost of methods like GRPO and DAPO.
arrol trains a lightweight, on-the-fly “quality head” to predict success probability of partial rollouts and uses it to make early pruning decisions that improve the correctness balance of the remaining samples.
By pruning inside the inference engine and re-batching surviving rollouts for log-probability computation and policy updates, arrol boosts training efficiency while preserving or improving learning signals.
Experiments on GRPO and DAPO with Qwen-3 and LLaMA-3.2 models (1B–8B) show +2.30 to +2.99 average accuracy improvements and up to 1.7x training speedup, with further gains up to +8.33 from test-time scaling using the learned quality head.
The authors provide an open-source code release (https://github.com/Hsu1023/ARRoL) to enable adoption and further evaluation of the method.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.