Benchmarking Deflection and Hallucination in Large Vision-Language Models
arXiv cs.AI / 4/15/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current vision-language model benchmarks miss key behaviors in retrieval-based QA, especially cases where visual and textual evidence conflict or retrieved knowledge is incomplete.
- It introduces a dynamic data curation pipeline to keep benchmark difficulty from degrading over time as LVLMs improve and can answer more questions without retrieval.
- It proposes VLM-DeflectionBench with 2,775 samples across diverse multimodal retrieval settings to test how models handle insufficient or misleading evidence by generating deflections.
- The authors define a fine-grained evaluation protocol with four scenarios to separate parametric memorization from retrieval robustness.
- Experiments on 20 state-of-the-art LVLMs show that models often fail to deflect when evidence is noisy or misleading, underscoring the need to measure “how they behave when they don’t know.”
Related Articles

Anthropic prepares Opus 4.7 and AI design tool, VCs offer up to 800 billion dollars
THE DECODER

ChatGPT Custom Instructions: The Ultimate Setup Guide
Dev.to

Best ChatGPT Alternatives 2026: 8 AI Tools Compared
Dev.to

Nghịch Lý Constraint: Hạn Chế AI Agent Nhiều Hơn, Code Tốt Hơn
Dev.to

Best AI for Coding: Copilot vs Claude vs Cursor
Dev.to