Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

arXiv cs.CV / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a decoder-side bias in Video-LLMs where generation over-concentrates on a single “anchor frame,” leading to temporally imbalanced evidence aggregation that correlates with hallucinations.
  • This anchor-frame dominance is found to be largely input-independent and reflects persistent model-specific structural/positional tendencies.
  • To mitigate the problem, the authors propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference technique that rebalances temporal visual attention in middle-to-late decoder layers.
  • DTR improves hallucination robustness across multiple Video-LLM families while maintaining competitive video understanding performance and high inference efficiency, without changing visual encoding or using auxiliary models.

Abstract

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.