Online Self-Calibration Against Hallucination in Vision-Language Models

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses hallucinations in large vision-language models (LVLMs), where the model can invent visual details not present in the input image.
It argues that existing offline preference-alignment approaches can suffer a “supervision–perception mismatch,” causing student models to learn to guess details they cannot truly perceive.
The authors identify a “generative–discriminative gap” in LVLMs, noting that these models perform better at discriminative verification than in open-ended generation, and uses this to enable more reliable self-supervision.
They propose OSCAR, an online self-calibration framework that uses Monte Carlo Tree Search plus a dual-granularity reward to build preference data and then iteratively refines the model via Direct Preference Optimization.
Experiments show OSCAR delivers state-of-the-art results on hallucination benchmarks and also boosts broader multimodal capabilities.

Abstract

Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.