$V_0$: A Generalist Value Model for Any Policy at State Zero

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper explains that standard actor-critic LLM training (e.g., PPO) uses a value/critic baseline that must track a continuously evolving policy, often requiring costly synchronous updates.
It reviews how GRPO removes the coupled value model by using group-average rewards as a baseline, but this shifts the burden to heavy sampling to keep estimates stable.
The authors propose $V_0$, a generalist value model that estimates expected performance on unseen prompts without parameter updates by treating the model’s changing capability as explicit context.
$V_0$ is framed as operating at “State Zero” (the initial prompt), using instruction–performance history to predict success rates ahead of rollout for more efficient training-time sampling.
The same predictions are used at deployment to route instructions to the most cost-effective suitable model, where experiments show $V_0$ improves over heuristic budgeting and yields a strong performance–cost Pareto trade-off for LLM routing.

Abstract

Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose

V_0

, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence

V_0

), our model serves as a critical resource scheduler. During GRPO training,

V_0

predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that

V_0

significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/1DailyView insight →

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

87.4% of My Agent's Decisions Run on a 0.8B Model

Dev.to

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

Dev.to

$V_0$: A Generalist Value Model for Any Policy at State Zero

Key Points

Abstract

💡 Insights using this article

Related Articles

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Does the Claude “leak” actually change anything in practice?

87.4% of My Agent's Decisions Run on a 0.8B Model

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer