Process Reward Agents for Steering Knowledge-Intensive Reasoning
arXiv cs.AI / 2026/4/13
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper introduces Process Reward Agents (PRA), a test-time method that supplies domain-grounded, online, step-wise rewards to a frozen reasoning policy when intermediate steps are not locally verifiable.
- Unlike prior process reward models that score completed trajectories post hoc, PRA uses search-based decoding to rank and prune candidate reasoning trajectories at every generation step, enabling integration into dynamic inference.
- Experiments on multiple medical reasoning benchmarks show PRA improves performance and reports 80.8% accuracy on MedQA with Qwen3-4B, described as state of the art at the 4B parameter scale.
- PRA generalizes across unseen frozen policy model backbones from 0.5B to 8B parameters, improving accuracy by up to 25.7% without updating model weights.
- The authors argue PRA supports a broader paradigm where frozen reasoners are decoupled from domain-specific reward modules, facilitating deploying new backbones in knowledge-intensive domains without retraining.



