Don't Blink: Evidence Collapse during Multimodal Reasoning

arXiv cs.AI / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Reasoning-based vision-language models (VLMs) can become more confident while progressively losing visual grounding, creating “evidence-collapse” failure modes that text-only monitoring may miss.
Experiments across MathVista, HallusionBench, and MMMU_Pro show that attention to annotated evidence regions can drop sharply during reasoning, sometimes losing more than half of the evidence mass.
Under cross-dataset transfer, full-response entropy is identified as the most reliable text-only uncertainty signal, while a simple vision-augmented monitoring rule is brittle and can reduce transfer performance.
The paper proposes an entropy–vision interaction view that distinguishes hazardous regimes (low-entropy but visually disengaged) from more benign ones depending on task type, and demonstrates that a targeted “vision veto” can reduce selective risk by up to 1.9 percentage points at 90% coverage.
Overall, the findings argue for task-aware multimodal monitoring to improve safety when models face distribution shifts and reasoning-time evidence may degrade.

Abstract

Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Don't Blink: Evidence Collapse during Multimodal Reasoning

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer