Don't Blink: Evidence Collapse during Multimodal Reasoning

arXiv cs.AI / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Reasoning-based vision-language models (VLMs) can become more confident while progressively losing visual grounding, creating “evidence-collapse” failure modes that text-only monitoring may miss.
  • Experiments across MathVista, HallusionBench, and MMMU_Pro show that attention to annotated evidence regions can drop sharply during reasoning, sometimes losing more than half of the evidence mass.
  • Under cross-dataset transfer, full-response entropy is identified as the most reliable text-only uncertainty signal, while a simple vision-augmented monitoring rule is brittle and can reduce transfer performance.
  • The paper proposes an entropy–vision interaction view that distinguishes hazardous regimes (low-entropy but visually disengaged) from more benign ones depending on task type, and demonstrates that a targeted “vision veto” can reduce selective risk by up to 1.9 percentage points at 90% coverage.
  • Overall, the findings argue for task-aware multimodal monitoring to improve safety when models face distribution shifts and reasoning-time evidence may degrade.

Abstract

Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.