Efficient Process Reward Modeling via Contrastive Mutual Information
arXiv cs.CL / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the high cost of training process reward models (PRMs) for chain-of-thought by avoiding step-level human reward annotations and expensive Monte Carlo (MC) rollouts.
- It introduces contrastive pointwise mutual information (CPMI) as an automatic reward-labeling method that uses the model’s internal probabilities to estimate a step’s contribution to the correct final answer versus hard-negative alternatives.
- CPMI computes how much a reasoning step increases mutual information between that step and the target answer, treating this contrastive signal as a reliable proxy reward for step-level supervision.
- Experiments report major efficiency gains, cutting dataset construction time by 84% and token generation by 98% relative to MC estimation, while improving process-level and mathematical reasoning evaluation accuracy.

