Off-Policy Value-Based Reinforcement Learning for Large Language Models
arXiv cs.LG / 3/25/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that scaling RL for LLMs in long-horizon settings is limited by on-policy training, which wastes expensive trajectories and reduces sample efficiency.
- It proposes ReVal, a value-based, Bellman-update-based RL framework that enables off-policy learning using replay buffers.
- ReVal combines stepwise internal-consistency signals with trajectory-level outcome-verification signals to train value estimates more effectively.
- Experiments on mathematical reasoning benchmarks show faster convergence and better final performance than GRPO, including gains on DeepSeek-R1-Distill-1.5B (up to +4.5% on GPQA and +2.7% on AIME24).
- The authors conclude that value-based RL can be a practical alternative to policy-based methods for LLM training, especially when trajectory generation is costly.
広告
Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.
Dev.to

The Redline Economy
Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks
Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists
Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure
Dev.to