Closing the Confidence-Faithfulness Gap in Large Language Models
arXiv cs.CL / 3/27/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes why LLMs’ verbalized confidence scores are poorly calibrated and proposes a mechanistic explanation using linear probes and contrastive activation addition steering.
- It finds that calibration (accuracy-related signals) and verbalized confidence are encoded in a linearly decodable way but are orthogonal to each other across multiple open-weight models and datasets.
- When prompts require the model to both reason and output a confidence score, the reasoning process can shift or disrupt the internal direction for confidence, worsening miscalibration via the “Reasoning Contamination Effect.”
- Using these findings, the authors propose a two-stage adaptive steering pipeline that leverages the model’s internal accuracy estimate to steer verbalized confidence, substantially improving confidence-to-accuracy alignment across evaluated models.
広告
Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.
Dev.to

The Redline Economy
Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks
Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists
Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure
Dev.to