Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

arXiv cs.CV / 4/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds that vision-language model (VLM) evaluators—used to judge other models’ outputs in both image-to-text and text-to-image settings—are not reliably trustworthy, despite growing real-world use.
It introduces targeted perturbations targeting key failure modes (object hallucinations, spatial/compositional errors, factual grounding issues, and visual fidelity) and uses a large benchmark of 4,000+ perturbed cases across 40 perturbation dimensions.
Across four prominent VLMs and multiple evaluation setups (single-answer scoring, pairwise comparison, and reference-guided methods), the evaluators often fail to detect degraded outputs, with blind spots sometimes exceeding 50%.
Pairwise comparison improves reliability compared with other paradigms, but significant error detection gaps remain, especially for fine-grained spatial/compositional problems and contradictory hallucinated content.
The authors release code and data and recommend caution when deploying evaluator VLMs for benchmarking or development decisions due to these reliability limitations.

Abstract

Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/24DailyView insight →

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Key Points

Abstract

💡 Insights using this article

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer