Off-Policy Value-Based Reinforcement Learning for Large Language Models

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that scaling RL for LLMs in long-horizon settings is limited by on-policy training, which wastes expensive trajectories and reduces sample efficiency.
  • It proposes ReVal, a value-based, Bellman-update-based RL framework that enables off-policy learning using replay buffers.
  • ReVal combines stepwise internal-consistency signals with trajectory-level outcome-verification signals to train value estimates more effectively.
  • Experiments on mathematical reasoning benchmarks show faster convergence and better final performance than GRPO, including gains on DeepSeek-R1-Distill-1.5B (up to +4.5% on GPQA and +2.7% on AIME24).
  • The authors conclude that value-based RL can be a practical alternative to policy-based methods for LLM training, especially when trajectory generation is costly.

Abstract

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.
広告