$V_0$: A Generalist Value Model for Any Policy at State Zero

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper explains that standard actor-critic LLM training (e.g., PPO) uses a value/critic baseline that must track a continuously evolving policy, often requiring costly synchronous updates.
  • It reviews how GRPO removes the coupled value model by using group-average rewards as a baseline, but this shifts the burden to heavy sampling to keep estimates stable.
  • The authors propose $V_0$, a generalist value model that estimates expected performance on unseen prompts without parameter updates by treating the model’s changing capability as explicit context.
  • $V_0$ is framed as operating at “State Zero” (the initial prompt), using instruction–performance history to predict success rates ahead of rollout for more efficient training-time sampling.
  • The same predictions are used at deployment to route instructions to the most cost-effective suitable model, where experiments show $V_0$ improves over heuristic budgeting and yields a strong performance–cost Pareto trade-off for LLM routing.

Abstract

Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose V_0, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence V_0), our model serves as a critical resource scheduler. During GRPO training, V_0 predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that V_0 significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.

$V_0$: A Generalist Value Model for Any Policy at State Zero | AI Navigate