$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

It proposes V_{0.5}, an adaptive baseline for RL with verifiable rewards that blends a pre-trained value model's prior with the empirical mean from sparse rollouts to reduce variance.
It introduces a real-time hypothesis test and dynamic budget allocation to judge the prior's reliability and allocate additional rollouts on demand.
The approach minimizes the baseline estimator's mean squared error, enabling stable policy gradients even under extreme data sparsity (group size of 4).
It reports faster convergence and about 10% performance improvement over GRPO and DAPO across six mathematical reasoning benchmarks.
It builds on Generalist Value Models (such as V_0) that encode model capabilities in-context, allowing value estimation without synchronizing updates with the policy model.

Abstract

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as

V_0

), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose

V_{0.5}

, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that