The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

arXiv cs.LG / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors show that extended reasoning through chain-of-thought prompting in vision-language models reduces the reliability of uncertainty estimates, even if it improves task accuracy.
The primary mechanism is implicit answer conditioning: as reasoning traces converge on a conclusion, token probabilities reflect consistency with the model's own reasoning rather than true uncertainty about correctness, leading to overconfidence.
In contrast, agreement-based consistency remains robust under reasoning and often improves, making it a practical uncertainty estimator in reasoning-enabled VLMs.
These findings have important implications for deploying VLMs in high-stakes settings and for designing reliable uncertainty quantification methods in such systems.

Abstract

Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model's own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.

The massive shift toward edge computing and local processing

Dev.to

Self-Refining Agents in Spec-Driven Development

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Key Points

Abstract

Related Articles

The massive shift toward edge computing and local processing

Self-Refining Agents in Spec-Driven Development

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer