How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

arXiv cs.LG / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper explains self-error detection and correction in LLMs using a second-order confidence framework, where an evaluative signal can disagree with the chosen response.
It tests whether the previously observed post-answer newline (PANL) confidence representation does more than verbal confidence by predicting error detection and self-correction.
Results from a verify-then-correct paradigm show that verbal confidence predicts error detection better than token log-probabilities (supporting a second-order, not first-order, account).
PANL activations further improve error detection and also predict which specific errors the model can correct; causal edits restoring answer information recover error-detection behavior.
The findings replicate across two model families (Gemma 3 27B, Qwen 2.5 7B) and two tasks (TriviaQA, MNLI), suggesting LLMs implement an internal second-order confidence architecture that captures both likelihood of wrongness and fixability.

Abstract

Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with the committed response, providing the basis for error detection. Kumaran et al. (2026) showed that LLMs cache a confidence representation at a token immediately following the answer (i.e. post-answer newline: PANL) -- that causally drives verbal confidence and dissociates from log-probabilities. Here we test whether this PANL signal extends beyond confidence to support error detection and self-correction. Here we test whether this signal supports error detection and self-correction, deriving predictions from the second-order framework. Using a verify-then-correct paradigm, we show that: (i) verbal confidence predicts error detection far beyond token log-probabilities, ruling out a first-order account; (ii) PANL activations predict error detection beyond verbal confidence itself; and (iii) PANL predicts which errors the model can correct -- where all behavioural signals fail. Causal interventions confirm that PANL signals rescue error detection behavior when answer information is corrupted. All findings replicate across models (Gemma 3 27B and Qwen 2.5 7B) and tasks (TriviaQA and MNLI). These results reveal that LLMs naturally implement a second-order confidence architecture whose internal evaluative signal encodes not only whether an answer is likely wrong but whether the model has the knowledge to fix it.