Closing the Confidence-Faithfulness Gap in Large Language Models

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes why LLMs’ verbalized confidence scores are poorly calibrated and proposes a mechanistic explanation using linear probes and contrastive activation addition steering.
It finds that calibration (accuracy-related signals) and verbalized confidence are encoded in a linearly decodable way but are orthogonal to each other across multiple open-weight models and datasets.
When prompts require the model to both reason and output a confidence score, the reasoning process can shift or disrupt the internal direction for confidence, worsening miscalibration via the “Reasoning Contamination Effect.”
Using these findings, the authors propose a two-stage adaptive steering pipeline that leverages the model’s internal accuracy estimate to steer verbalized confidence, substantially improving confidence-to-accuracy alignment across evaluated models.

Abstract

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.