STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that video-LLMs produce spatiotemporal hallucinations because different decoder layers contribute differently to visual grounding versus later language composition, so mitigation should be layer-aware rather than globally applied.
It proposes STEAR, which selects high-risk decoding steps and retrieves token-conditioned visual evidence specifically from grounding-sensitive middle layers to guide correction.
STEAR uses the same evidence for two linked interventions: restoring missing local grounding in middle layers and creating temporally perturbed patch-level counterfactuals to challenge inconsistent reasoning in late-layer decoding.
Experiments on multiple Video-LLM backbones and benchmarks show STEAR reduces both spatial and temporal hallucinations while improving faithfulness, temporal consistency, and robustness.
The authors claim hallucination mitigation is most effective when intervening with precise evidence at the right layer, and they provide code in supplementary materials.

Abstract

Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.