Controllable and Verifiable Process Data Synthesis for Process Reward Models

arXiv cs.AI / 5/5/2026

📰 NewsModels & Research

Key Points

  • The paper introduces a controllable and verifiable method to synthesize process supervision data for process reward models (PRMs), addressing limits in existing data construction approaches.
  • It generates a correct symbolic reasoning chain, injects a template-aware error into a specific intermediate step, then recomputes the remaining steps under the corrupted state and verifies the injected error cannot be derived from its prefix.
  • The method produces paired trajectories that are invalid at the first error (prefix-invalid) while remaining consistent after recomputation, and converts them into aligned natural-language processes for PRM training and evaluation.
  • Experiments indicate that the synthesized data improve Best-of-8 reranking performance on logical reasoning benchmarks and transfer to mathematical reasoning, with step-level tests showing error localization is harder than overall step classification.
  • The work emphasizes the need for fine-grained, verifiable process supervision and provides an evaluation lens focused on first-error localization difficulty.

Abstract

Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.