ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
arXiv cs.CL / 4/10/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces ConsistRM, a self-training framework for generative reward models (GRMs) that aims to align LLMs with human preferences without requiring costly human-annotated reward data.
- It proposes a Consistency-Aware Answer Reward to generate pseudo-labels that are reliable and maintain temporal consistency, improving stability of GRM training and optimization.
- It also adds a Consistency-Aware Critique Reward that evaluates semantic consistency across multiple critiques and assigns fine-grained, differentiated rewards to reduce weaknesses seen in prior self-training methods.
- Experiments across five benchmark datasets and four base models show ConsistRM outperforms vanilla reinforcement fine-tuning (RFT) by an average of 1.5%, while analysis indicates better output consistency and reduced position bias from input order.
Related Articles

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014
Dev.to

Emergency Room and the Vanishing Moat
Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How
Dev.to