Efficient Process Reward Modeling via Contrastive Mutual Information

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the high cost of training process reward models (PRMs) for chain-of-thought by avoiding step-level human reward annotations and expensive Monte Carlo (MC) rollouts.
It introduces contrastive pointwise mutual information (CPMI) as an automatic reward-labeling method that uses the model’s internal probabilities to estimate a step’s contribution to the correct final answer versus hard-negative alternatives.
CPMI computes how much a reasoning step increases mutual information between that step and the target answer, treating this contrastive signal as a reliable proxy reward for step-level supervision.
Experiments report major efficiency gains, cutting dataset construction time by 84% and token generation by 98% relative to MC estimation, while improving process-level and mathematical reasoning evaluation accuracy.

Abstract

Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model's internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step's contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.

Black Hat Asia

AI Business

What Most Beginners Get Wrong About Building AI Apps

Dev.to

AI Is Replacing Freshers? The Harsh Truth No One Is Telling You (Read Before It’s Too Late)

Dev.to

How AI is changing cybersecurity

Dev.to

Evaluating LLMs for Code Generation: Accuracy, Latency, and Failure Modes

Dev.to

Efficient Process Reward Modeling via Contrastive Mutual Information

Key Points

Abstract

Related Articles

Black Hat Asia

What Most Beginners Get Wrong About Building AI Apps

AI Is Replacing Freshers? The Harsh Truth No One Is Telling You (Read Before It’s Too Late)

How AI is changing cybersecurity

Evaluating LLMs for Code Generation: Accuracy, Latency, and Failure Modes

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer