Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
arXiv cs.LG / 4/14/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper revisits RL credit assignment in LLM reinforcement learning and argues that conventional discriminative (one-shot scalar) value critics are difficult to train reliably due to limited expressiveness under the one-shot prediction paradigm.
- It cites representation complexity theory and scaling experiments showing that these critics do not improve reliably with increased scale.
- To address this, the authors propose Generative Actor-Critic (GenAC), replacing one-shot value prediction with a generative critic that performs chain-of-thought reasoning before outputting a value estimate.
- They add In-Context Conditioning to keep the critic calibrated to the current actor during training, improving both value approximation quality and robustness.
- Experiments indicate GenAC improves ranking reliability, out-of-distribution generalization, and yields stronger downstream RL performance than value-based and value-free baselines.
Related Articles

Don't forget, there is more than forgetting: new metrics for Continual Learning
Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale
Dev.to
Bit of a strange question?
Reddit r/artificial

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card
Dev.to

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card
Dev.to