Controllable and Verifiable Process Data Synthesis for Process Reward Models

arXiv cs.AI / 5/5/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces a controllable and verifiable method to synthesize process supervision data for process reward models (PRMs), addressing limits in existing data construction approaches.
It generates a correct symbolic reasoning chain, injects a template-aware error into a specific intermediate step, then recomputes the remaining steps under the corrupted state and verifies the injected error cannot be derived from its prefix.
The method produces paired trajectories that are invalid at the first error (prefix-invalid) while remaining consistent after recomputation, and converts them into aligned natural-language processes for PRM training and evaluation.
Experiments indicate that the synthesized data improve Best-of-8 reranking performance on logical reasoning benchmarks and transfer to mathematical reasoning, with step-level tests showing error localization is harder than overall step classification.
The work emphasizes the need for fine-grained, verifiable process supervision and provides an evaluation lens focused on first-error localization difficulty.

Abstract

Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Dev.to

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

Dev.to

How an AI Agent Executed 500+ Real-World Operations and Built Its Own Recovery Engine

Dev.to

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Reddit r/LocalLLaMA

Controllable and Verifiable Process Data Synthesis for Process Reward Models

Key Points

Abstract

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

How an AI Agent Executed 500+ Real-World Operations and Built Its Own Recovery Engine

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer