Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

arXiv cs.CL / 4/28/2026

📰 NewsModels & Research

Key Points

  • The paper examines how small instruct-tuned LLMs can output degenerate verbal confidence, with very high ceiling rates and poor Type-2 AUROC, and tests confidence-conditioned supervised fine-tuning (CSFT) to align internal information with verbal readouts.
  • A pre-registered Phase 0 experiment on Gemma 3 4B-it using a modal filter (training only on items with correct modal answers) resulted in a negative outcome, with AUROC2 decreasing due to label-entropy collapse in the generated targets.
  • An exploratory post-hoc “rescue” that removed the modal filter and trained on all 2,000 calibration items produced a strong binary verbal correctness discriminator (AUROC2 = 0.774 on held-out TriviaQA) and compresses multi-sample self-consistency signals into a single-pass readout.
  • Controls and ablations showed no improvement with shuffled targets, while MMLU accuracy increased substantially, supporting that the approach depends on the quality/structure of the targets being used.
  • The authors conclude the findings are exploratory and limited to a single model scale, but they derive two key design lessons: confidence training needs sufficient label entropy, and correct targets help regularize output formatting.

Abstract

Small instruct-tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near-chance Type-2 AUROC, and Invalid validity profiles. We test whether confidence-conditioned supervised fine-tuning (CSFT) with self-consistency-derived targets can close the gap between internal information and verbal readout. A pre-registered Phase 0 protocol on Gemma 3 4B-it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label-entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held-out TriviaQA, compressing a 10-sample self-consistency signal (AUROC2 = 0.999) into a single-pass readout exceeding logit entropy (0.701). The shuffled-target control showed no improvement (0.501). On MMLU, accuracy improved from 54.2% to 77.4% with the shuffled model at baseline (56.1%), supporting a target-dependent interpretation. The result is exploratory, binary rather than continuously calibrated, and observed at a single scale. It identifies two design lessons: confidence training requires label entropy, and correct targets regularise output format.