Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reports that multimodal LLMs can exhibit “visual attention inertia,” where visual attention stays largely static after early decoding steps and does not support the compositional reasoning needed for cognitive inference.
It argues that many existing hallucination mitigation approaches focus on perceptual hallucinations (e.g., whether an object exists or its attributes) and do not adequately address cognitive hallucinations requiring relational deduction between objects.
Using token-wise attention analysis, the authors identify visual inertia—persistently focused attention on semantically critical regions—as a key driver of this failure to perform inter-object relational inference.
They propose a training-free Inertia-aware Visual Excitation (IVE) method that dynamically selects emerging visual tokens and applies an inertia-aware penalty to reduce over-concentration and attention persistence in localized regions.
Experimental results indicate that IVE improves mitigation of cognitive hallucinations across multiple base MLLMs and several hallucination benchmarks.

Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.