PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

arXiv cs.RO / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that traditional robotic evaluation using only binary success rates fails to capture important execution qualities like progress, efficiency, and stability.
It proposes PRM-as-a-Judge, a dense evaluation approach that uses Process Reward Models to audit policy execution from trajectory videos by estimating task progress from observation sequences.
The work introduces the OPD (Outcome-Process-Diagnosis) metric framework, defining execution quality via task-aligned progress potential.
It formalizes dense evaluation with two axiomatic properties—macro-consistency (additive, path-consistent aggregation) and micro-resolution (sensitivity to fine-grained physical evolution)—and connects these to potential-based PRM judges.
Experiments on the RoboPulse diagnostic benchmark show PRM judges outperform similarity-based discriminators and general-purpose foundation-model judges, and the authors use PRM-as-a-Judge plus OPD to reveal hidden behavioral signatures and failure modes across long-horizon policy paradigms.

Abstract

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

Dev.to

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

Dev.to

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

Dev.to

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

THE DECODER

PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

Key Points

Abstract

Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer