Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

arXiv cs.LG / 4/22/2026

📰 NewsModels & Research

共有:

Key Points

The paper addresses a key reliability gap in reasoning LLMs: they often fail to output confidence scores that are properly calibrated for trustworthy real-world deployment.
It proposes an unsupervised confidence calibration method that works with only a single generation at inference time, avoiding the need for labeled data or repeated sampling.
The method performs offline sampling on unlabeled data to create a self-consistency-based proxy target, then distills that into a lightweight confidence predictor for deployment.
Experiments across 5 math/QA tasks with 9 reasoning models show substantial improvements over baselines, including robustness under distribution shift.
The calibrated confidence boosts downstream use cases such as selective prediction and simulated decision-making pipelines.
Point 2
Point 3

Abstract

Reasoning language models can solve increasingly complex tasks, but struggle to produce the calibrated confidence estimates necessary for reliable deployment. Existing calibration methods usually depend on labels or repeated sampling at inference time, making them impractical in many settings. We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

Dev.to

AI swarms could hijack democracy without anyone noticing

Reddit r/artificial

Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

Key Points

Abstract

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

AI swarms could hijack democracy without anyone noticing

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer