Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B
arXiv cs.CL / 4/28/2026
📰 NewsModels & Research
Key Points
- The paper examines how small instruct-tuned LLMs can output degenerate verbal confidence, with very high ceiling rates and poor Type-2 AUROC, and tests confidence-conditioned supervised fine-tuning (CSFT) to align internal information with verbal readouts.
- A pre-registered Phase 0 experiment on Gemma 3 4B-it using a modal filter (training only on items with correct modal answers) resulted in a negative outcome, with AUROC2 decreasing due to label-entropy collapse in the generated targets.
- An exploratory post-hoc “rescue” that removed the modal filter and trained on all 2,000 calibration items produced a strong binary verbal correctness discriminator (AUROC2 = 0.774 on held-out TriviaQA) and compresses multi-sample self-consistency signals into a single-pass readout.
- Controls and ablations showed no improvement with shuffled targets, while MMLU accuracy increased substantially, supporting that the approach depends on the quality/structure of the targets being used.
- The authors conclude the findings are exploratory and limited to a single model scale, but they derive two key design lessons: confidence training needs sufficient label entropy, and correct targets help regularize output formatting.
Related Articles

An improvement of the convergence proof of the ADAM-Optimizer
Dev.to
We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.
Reddit r/artificial
langchain-tests==1.1.7
LangChain Releases
Why isn’t LLM reasoning done in vector space instead of natural language?
Reddit r/LocalLLaMA
llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged
Reddit r/LocalLLaMA