See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a causal driver of object hallucination in multimodal LLMs as inequitable attention during decoding, where rare/small/peripheral objects receive too little grounding information.
  • It proposes DOP-OBC, a training-free, architecture-agnostic decoding strategy that enforces more equitable attention via two mechanisms: Dominant Object Penalty (DOP) and Outlier Boost Coefficient (OBC).
  • DOP and OBC are implemented as per-row logit modulations within the causal attention mask, avoiding weight updates while maintaining autoregressive decoding behavior.
  • Experiments across image and video MLLMs show consistent hallucination reductions on CHAIR and POPE benchmarks and improved GPT-4o captioning quality across multiple evaluation dimensions.
  • The work frames fairness in attention allocation as a practical method for improving faithfulness in vision-language generation rather than only a theoretical design principle.

Abstract

Multimodal large language models (MLLMs) frequently hallucinate objects that are absent from the visual input, often because attention during decoding is disproportionately drawn to visually dominant or frequently occurring content. We observe that this inequity in attention allocation is a root cause of object hallucination: when rare, small, or contextually peripheral objects receive insufficient attention, the model fails to ground its generation in the full visual scene. We argue that every object in an image, regardless of its size, frequency or visual salience, deserves equal representational opportunity during decoding. To this end, we propose DOP-OBC, a training-free and architecture-agnostic decoding strategy built on the principle of equitable attention. Two complementary object-aware signals work in tandem: a Dominant Object Penalty (DOP) that softly suppresses attention over-concentration on visually dominant regions, and an Outlier Boost Coefficient (OBC) that amplifies attention toward rare yet confidently detected objects. These signals are injected as per-row logit modulations within the causal attention mask, requiring no weight updates and preserving autoregressive decoding properties. Extensive experiments across image and video MLLMs demonstrate consistent reductions in object hallucination on CHAIR and POPE benchmarks, alongside improvements in GPT-4o assessed captioning quality across correctness, consistency, detail, context and temporal dimensions. DOP-OBC establishes that fairness in attention allocation is not merely a design principle but a practical and effective path toward more faithful multimodal generation.