Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling
arXiv cs.LG / 4/28/2026
📰 NewsModels & Research
Key Points
- The paper argues that current RLHF reward models, trained to score only the final token, waste informative signals from intermediate tokens and lead to noisy token-level predictions.
- It proposes Temporally Coherent Reward Modeling (TCRM), adding two regularization terms to the standard Bradley–Terry loss so that each token-level reward output becomes the conditional expectation of the final reward given the response so far.
- The regularizers are connected to value-learning by mapping to Monte Carlo and temporal-difference (TD) style objectives, linking reward modeling outputs directly to RL value functions.
- Experiments report major improvements in interpretable token-level trajectories (middle-token pairwise accuracy from 50% to 88.9%) while keeping final-token accuracy intact, alongside strong performance on ProcessBench (44.9% average F1) using outcome-only data.
- TCRM also enables a unified reward/value approach in PPO, achieving efficiency gains of 27% lower peak GPU memory and 19% faster step time without sacrificing LLM quality.
Related Articles
Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to
How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to
🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to
Real-Time Monitoring for AI Agents: Beyond Log Streaming
Dev.to
Top 10 Physical AI Models Powering Real-World Robots in 2026
MarkTechPost