VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

arXiv cs.CL / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that large vision-language models often produce hallucinations or incorrect answers with high confidence, and that existing “single-score” confidence calibration methods for text-only LLMs do not match the LVLM error structure.
It proposes VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual confidence (perception/grounding) and reasoning confidence (answer generation given perception).
For visual supervision without ground-truth perception labels, the method introduces an intrinsic visual certainty estimate that combines uncertainty from image perturbation grounding (via KL-divergence) with internal token entropy.
It further uses token-level advantage reweighting guided by visual certainty to reduce optimization on ungrounded hallucination tokens while preserving properly grounded perception.
Experiments across thirteen benchmarks show improved confidence calibration and higher visual reasoning accuracy, with generalization to out-of-distribution benchmarks across different model scales and architectures.

Abstract

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.