Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

arXiv cs.CL / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper examines how small instruct-tuned LLMs can output degenerate verbal confidence, with very high ceiling rates and poor Type-2 AUROC, and tests confidence-conditioned supervised fine-tuning (CSFT) to align internal information with verbal readouts.
A pre-registered Phase 0 experiment on Gemma 3 4B-it using a modal filter (training only on items with correct modal answers) resulted in a negative outcome, with AUROC2 decreasing due to label-entropy collapse in the generated targets.
An exploratory post-hoc “rescue” that removed the modal filter and trained on all 2,000 calibration items produced a strong binary verbal correctness discriminator (AUROC2 = 0.774 on held-out TriviaQA) and compresses multi-sample self-consistency signals into a single-pass readout.
Controls and ablations showed no improvement with shuffled targets, while MMLU accuracy increased substantially, supporting that the approach depends on the quality/structure of the targets being used.
The authors conclude the findings are exploratory and limited to a single model scale, but they derive two key design lessons: confidence training needs sufficient label entropy, and correct targets help regularize output formatting.

Abstract

Small instruct-tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near-chance Type-2 AUROC, and Invalid validity profiles. We test whether confidence-conditioned supervised fine-tuning (CSFT) with self-consistency-derived targets can close the gap between internal information and verbal readout. A pre-registered Phase 0 protocol on Gemma 3 4B-it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label-entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held-out TriviaQA, compressing a 10-sample self-consistency signal (AUROC2 = 0.999) into a single-pass readout exceeding logit entropy (0.701). The shuffled-target control showed no improvement (0.501). On MMLU, accuracy improved from 54.2% to 77.4% with the shuffled model at baseline (56.1%), supporting a target-dependent interpretation. The result is exploratory, binary rather than continuously calibrated, and observed at a single scale. It identifies two design lessons: confidence training requires label entropy, and correct targets regularise output format.

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

Reddit r/artificial

langchain-tests==1.1.7

LangChain Releases

Why isn’t LLM reasoning done in vector space instead of natural language?

Reddit r/LocalLLaMA

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

Reddit r/LocalLLaMA

Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

Key Points

Abstract

Related Articles

An improvement of the convergence proof of the ADAM-Optimizer

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

langchain-tests==1.1.7

Why isn’t LLM reasoning done in vector space instead of natural language?

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer