How do LLMs Compute Verbal Confidence

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how LLMs generate “verbal confidence” tokens, addressing both *when* confidence is computed (cached during generation vs computed just-in-time) and *what* it represents internally (simple token log-probabilities vs richer representations of answer quality).
  • Using experiments on Gemma 3 27B and Qwen 2.5 7B—including activation steering, patching, noising, and swap tests—the authors find evidence that confidence is computed during answer generation and cached for later retrieval.
  • The study shows that confidence-related information emerges in answer-adjacent hidden states first, is stored at an early post-answer position, and is then retrieved when the model produces the verbalized confidence.
  • Attention blocking experiments pinpoint the information flow, indicating that confidence is gathered from answer tokens rather than being independently reconstructed at the verbalization site.
  • Linear probing and variance partitioning reveal that cached representations explain substantial variance in verbal confidence beyond token log-probabilities, supporting the view that verbal confidence is a sophisticated self-evaluation mechanism with implications for calibration and LLM metacognition.

Abstract

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.