Revealing Multi-View Hallucination in Large Vision-Language Models

arXiv cs.CV / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a failure mode in large vision-language models used with multi-view inputs, where the model mismatches visual evidence across instances or viewpoints, which the authors call “multi-view hallucination.”
  • It introduces MVH-Bench with 4.8k question-answer pairs to systematically measure two hallucination types: cross-instance hallucination and cross-view hallucination.
  • Experiments show that recent LVLMs have difficulty correctly linking the right visual evidence to the corresponding instance/viewpoint.
  • The authors propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding method that reduces visual interference by generating negative logits via attention masking.
  • RSCD improves results on MVH-Bench using Qwen2.5-VL and LLaVA-OneVision, achieving gains up to 21.1 and 34.6 points over existing mitigation approaches.

Abstract

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.