Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

arXiv cs.AI / 4/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates whether seven 3–9B open-weight, instruction-tuned LLMs generate verbalized confidence that satisfies minimal psychometric validity criteria for item-level Type-2 discrimination under low-effort elicitation (greedy decoding, minimal numeric querying).
A pre-registered study using 524 TriviaQA items (8,384 deterministic trials) found all seven models were classified “Invalid” for numeric confidence, with a high ceiling rate (~91.7%), indicating confidence outputs did not meaningfully discriminate items.
Switching to categorical (10-class) confidence elicitation did not restore validity; it instead degraded task accuracy in six of seven models to below 5%.
Token-level logprobabilities did not predict verbalized confidence (cross-validated R^2 < 0.01), and in one reasoning-distilled model longer reasoning traces were strongly negatively correlated with confidence (rho = -0.36), aligning with a “Reasoning Contamination Effect.”
The authors conclude that minimal verbal elicitation may fail to preserve internal uncertainty signals at the output interface for this model size, and that psychometric screening should be done before relying on such signals downstream.

Abstract

Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p < .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.