Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses hallucinations in long-form generation by targeting how reasoning and final answers are connected, where errors can compound over many steps.
It introduces an “Exploration-Commitment Decoupling” approach that separates knowledge exploration from final commitment, allowing more cautious and controlled answering.
The proposed Calibration-Aware Generation (CAG) framework adds calibrated reliability estimates to intermediate reasoning and uses them to prioritize reliable content in the final output.
Experiments across five long-form factuality benchmarks and multiple model families show up to 13% improvement in factuality and up to 37% reduction in decoding time.
The work argues that decoupling exploration from commitment is a principled direction toward more trustworthy, self-aware generative systems.

Abstract

Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches to improving factuality, including abstention and factuality-driven optimization, follow a \emph{coupled exploration-commitment} paradigm, in which intermediate reasoning is unconditionally propagated to the final output, limiting fine-grained control over information selection and integration. In this paper, we propose an \textbf{Exploration-Commitment Decoupling} paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with \textbf{Calibration-Aware Generation (CAG)}, a framework that equips models with end-to-end, calibration-aware generation capabilities, by augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs. Across five long-form factuality benchmarks and multiple model families, CAG improves factuality by up to 13%, while reducing decoding time by up to 37%. Overall, our work highlights decoupling as a principled approach for more reliable long-form generation, offering directions for trustworthy and self-aware generative systems.