A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper surveys Process Reward Models (PRMs) as an alternative to outcome reward models (ORMs) by rewarding and guiding LLM reasoning at the step or trajectory level.
It lays out an end-to-end “full loop” perspective, covering how to generate process data, construct PRMs, and apply them for test-time scaling and reinforcement learning.
The survey compiles applications of PRMs across multiple domains, including math, code, text, multimodal reasoning, robotics, and agent-based systems.
It reviews emerging benchmarks and aims to clarify design trade-offs and highlight open challenges for achieving fine-grained, robust reasoning alignment.
Overall, the work is positioned as a research roadmap to advance alignment beyond final-answer supervision toward reasoning supervision.

Abstract

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

Dev.to

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Key Points

Abstract

Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer