Don't Blink: Evidence Collapse during Multimodal Reasoning
arXiv cs.AI / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Reasoning-based vision-language models (VLMs) can become more confident while progressively losing visual grounding, creating “evidence-collapse” failure modes that text-only monitoring may miss.
- Experiments across MathVista, HallusionBench, and MMMU_Pro show that attention to annotated evidence regions can drop sharply during reasoning, sometimes losing more than half of the evidence mass.
- Under cross-dataset transfer, full-response entropy is identified as the most reliable text-only uncertainty signal, while a simple vision-augmented monitoring rule is brittle and can reduce transfer performance.
- The paper proposes an entropy–vision interaction view that distinguishes hazardous regimes (low-entropy but visually disengaged) from more benign ones depending on task type, and demonstrates that a targeted “vision veto” can reduce selective risk by up to 1.9 percentage points at 90% coverage.
- Overall, the findings argue for task-aware multimodal monitoring to improve safety when models face distribution shifts and reasoning-time evidence may degrade.
Related Articles

Black Hat Asia
AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter
TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to