Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

arXiv cs.CV / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies that vision-language process reward models (VL-PRMs) often confound perceptual accuracy with reasoning, leading to false positives and negatives in step scoring.
It introduces Explicit Visual Premise Verification (EVPV), a lightweight interface that conditions step scoring on the reliability of the visual premises via a step-wise visual checklist and an independent constraint extractor.
EVPV computes a scalar visual reliability signal by comparing checklist claims to extracted visual constraints, enabling reliability gating that attenuates rewards for visually dependent steps when reliability is low and preserves them when high.
The method decouples perceptual uncertainty from logical evaluation without per-step tool calls, aiming to improve both verification and error localization.
Empirical results on VisualProcessBench and six multimodal reasoning benchmarks show improved step-level verification and better Best-of-N reranking; controlled constraint corruption demonstrates causal gains from constraint fidelity, and code is available at the linked repository.

Abstract

Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Dev.to

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Key Points

Abstract

Related Articles

Automating the Chase: AI for Festival Vendor Compliance

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer