Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether a large language model’s output uncertainty and its actual correctness are controlled by the same internal mechanisms or by different feature sets.
  • It proposes a 2×2 correctness-confidence partitioning framework and uses sparse autoencoders to isolate features linked independently to uncertainty and incorrectness.
  • Experiments on Llama-3.1-8B and Gemma-2-9B find distinct populations: “pure uncertainty” features are crucial for accuracy, while “pure incorrectness” features are largely inert when suppressed.
  • “Confounded” features that encode both uncertainty and incorrectness are shown to harm output quality; suppressing them improves accuracy by 1.1% and reduces entropy by 75% across ARC-Challenge and RACE.
  • The authors also show that a tiny set of confounded features (3 from a single mid-network layer) can predict correctness with AUROC ~0.79, enabling selective abstention that boosts accuracy from 62% to 81% at 53% coverage.

Abstract

Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.