Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR
arXiv cs.CL / 3/27/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes arrol, an online rollout pruning approach for Reinforcement Learning with Verifiable Rewards (RLVR) that prunes trajectories during generation to reduce the heavy compute cost of methods like GRPO and DAPO.
- arrol trains a lightweight, on-the-fly “quality head” to predict success probability of partial rollouts and uses it to make early pruning decisions that improve the correctness balance of the remaining samples.
- By pruning inside the inference engine and re-batching surviving rollouts for log-probability computation and policy updates, arrol boosts training efficiency while preserving or improving learning signals.
- Experiments on GRPO and DAPO with Qwen-3 and LLaMA-3.2 models (1B–8B) show +2.30 to +2.99 average accuracy improvements and up to 1.7x training speedup, with further gains up to +8.33 from test-time scaling using the learned quality head.
- The authors provide an open-source code release (https://github.com/Hsu1023/ARRoL) to enable adoption and further evaluation of the method.
広告
Related Articles
![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3618325%252F470cf6d0-e54c-4ddf-8d83-e3db9f829f2b.jpg&w=3840&q=75)
[Boost]
Dev.to

Managing LLM context in a real application
Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.
Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)
Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned
Dev.to