VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
arXiv cs.CL / 4/13/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that large vision-language models often produce hallucinations or incorrect answers with high confidence, and that existing “single-score” confidence calibration methods for text-only LLMs do not match the LVLM error structure.
- It proposes VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual confidence (perception/grounding) and reasoning confidence (answer generation given perception).
- For visual supervision without ground-truth perception labels, the method introduces an intrinsic visual certainty estimate that combines uncertainty from image perturbation grounding (via KL-divergence) with internal token entropy.
- It further uses token-level advantage reweighting guided by visual certainty to reduce optimization on ungrounded hallucination tokens while preserving properly grounded perception.
- Experiments across thirteen benchmarks show improved confidence calibration and higher visual reasoning accuracy, with generalization to out-of-distribution benchmarks across different model scales and architectures.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Apple is building smart glasses without a display to serve as an AI wearable
THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to