Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
arXiv cs.LG / 4/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies whether a large language model’s output uncertainty and its actual correctness are controlled by the same internal mechanisms or by different feature sets.
- It proposes a 2×2 correctness-confidence partitioning framework and uses sparse autoencoders to isolate features linked independently to uncertainty and incorrectness.
- Experiments on Llama-3.1-8B and Gemma-2-9B find distinct populations: “pure uncertainty” features are crucial for accuracy, while “pure incorrectness” features are largely inert when suppressed.
- “Confounded” features that encode both uncertainty and incorrectness are shown to harm output quality; suppressing them improves accuracy by 1.1% and reduces entropy by 75% across ARC-Challenge and RACE.
- The authors also show that a tiny set of confounded features (3 from a single mid-network layer) can predict correctness with AUROC ~0.79, enabling selective abstention that boosts accuracy from 62% to 81% at 53% coverage.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Why use an AI gateway at all?
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to