Confidence Should Be Calibrated More Than One Turn Deep

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that LLM confidence calibration must be treated as a dynamic, conversation-history-dependent problem rather than a static single-turn property for high-stakes multi-turn use cases.
It introduces a multi-turn calibration task and a new metric, ECE@T, to measure how calibration changes across turns, showing that user feedback can worsen multi-turn calibration.
To improve calibration, the authors propose MTCal, which minimizes ECE@T using a surrogate calibration target conditioned on prior dialogue.
They also present ConfChat, a decoding strategy that uses calibrated confidence to improve response factuality and consistency in multi-turn interactions.
Experiments report that MTCal yields strong, consistent performance for multi-turn calibration and that ConfChat maintains or improves overall multi-turn model quality.

Abstract

Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

Black Hat Asia

AI Business

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Every AI Agent Registry in 2026, Compared

Dev.to

Confidence Should Be Calibrated More Than One Turn Deep

Key Points

Abstract

Related Articles

Black Hat Asia

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Every AI Agent Registry in 2026, Compared

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer