Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

arXiv cs.LG / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that current RLHF reward models, trained to score only the final token, waste informative signals from intermediate tokens and lead to noisy token-level predictions.
It proposes Temporally Coherent Reward Modeling (TCRM), adding two regularization terms to the standard Bradley–Terry loss so that each token-level reward output becomes the conditional expectation of the final reward given the response so far.
The regularizers are connected to value-learning by mapping to Monte Carlo and temporal-difference (TD) style objectives, linking reward modeling outputs directly to RL value functions.
Experiments report major improvements in interpretable token-level trajectories (middle-token pairwise accuracy from 50% to 88.9%) while keeping final-token accuracy intact, alongside strong performance on ProcessBench (44.9% average F1) using outcome-only data.
TCRM also enables a unified reward/value approach in PPO, achieving efficiency gains of 27% lower peak GPU memory and 19% faster step time without sacrificing LLM quality.

Abstract

Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model's output at any token should represent the conditional expectation of the final reward given the response so far. We introduce Temporally Coherent Reward Modeling (TCRM), which induces this property via two regularization terms on top of the standard Bradley-Terry loss, with minimizers provably equal to conditional expectations. The regularizers correspond to Monte Carlo and TD value-learning objectives, establishing a direct connection to RL value functions. TCRM requires zero changes to architecture, data, or inference, yet unlocks three capabilities from one principle: interpretable token-level reward trajectories (middle-token pairwise accuracy improved from 50% to 88.9%, final-token accuracy preserved); state-of-the-art PRM performance on ProcessBench (44.9% average F1) among models trained only on outcome data; and unified reward/value modeling in PPO, reducing peak GPU memory by 27% and step time by 19% with matching LLM quality.

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

Real-Time Monitoring for AI Agents: Beyond Log Streaming

Dev.to

Top 10 Physical AI Models Powering Real-World Robots in 2026

MarkTechPost

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

Key Points

Abstract

Related Articles

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Real-Time Monitoring for AI Agents: Beyond Log Streaming

Top 10 Physical AI Models Powering Real-World Robots in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer