Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
arXiv cs.AI / 4/27/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper evaluates whether seven 3–9B open-weight, instruction-tuned LLMs generate verbalized confidence that satisfies minimal psychometric validity criteria for item-level Type-2 discrimination under low-effort elicitation (greedy decoding, minimal numeric querying).
- A pre-registered study using 524 TriviaQA items (8,384 deterministic trials) found all seven models were classified “Invalid” for numeric confidence, with a high ceiling rate (~91.7%), indicating confidence outputs did not meaningfully discriminate items.
- Switching to categorical (10-class) confidence elicitation did not restore validity; it instead degraded task accuracy in six of seven models to below 5%.
- Token-level logprobabilities did not predict verbalized confidence (cross-validated R^2 < 0.01), and in one reasoning-distilled model longer reasoning traces were strongly negatively correlated with confidence (rho = -0.36), aligning with a “Reasoning Contamination Effect.”
- The authors conclude that minimal verbal elicitation may fail to preserve internal uncertainty signals at the output interface for this model size, and that psychometric screening should be done before relying on such signals downstream.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

GET Serves Cache, POST Runs Inference: Cost Safety for a Public LLM Endpoint
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to