GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

arXiv cs.AI / 5/5/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces GR-Ben, a new process-level benchmark for evaluating process reward models (PRMs) at detecting intermediate reasoning errors across real-world-style reasoning tasks.
Existing benchmarks are criticized for focusing mostly on mathematical reasoning, which leaves PRM error-detection performance largely untested in broader science and logic settings.
GR-Ben covers two main domains (science and logic) split into nine subdomains, enabling more comprehensive assessment than prior work.
Experiments across 22 models (including both PRMs and LLMs) show that error detection is generally weaker outside math, PRMs struggle more with knowledge-based errors, while LLMs perform worse at catching computational errors.
The authors suggest GR-Ben will help drive future PRM research for general domains and ultimately improve LLM reasoning quality.

Abstract

Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors.We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Dev.to

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

Dev.to

How an AI Agent Executed 500+ Real-World Operations and Built Its Own Recovery Engine

Dev.to

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Reddit r/LocalLLaMA

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

Key Points

Abstract

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

How an AI Agent Executed 500+ Real-World Operations and Built Its Own Recovery Engine

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer