I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • LLMの幻覚は「確信があるが誤りの回答」を出してしまう点が課題で、二値的な採点設計が“正直な不確実性の表明(棄権)”より“回答”を優遇することが一因だと述べています。
  • モデル改修なしのプロンプト介入として、答える/棄権するの報酬設計を明示し、さらに真実性・謙虚さ・責任を促す規範(norms)を組み込んだI-CALMを提案しています。
  • 事実質問で検証可能な正解がある設定(epistemic abstention on factual questions)で、自己申告の言語的な自信を不確実性シグナルとして使い、言い換えへの頑健性やトークン確率ベースラインとの較正がある程度成立することを示します。
  • GPT-5 mini(PopQA)で、特に“自信の引き出し+棄権の報酬+規範”の組合せにより、誤答率が主に「誤りやすいケースを棄権へ移す」形で下がり、カバレッジ(回答範囲)と信頼性(reliability)のトレードオフが生じることを報告しています。
  • 棄権報酬を変えることで「棄権時の幻覚(abstention-hallucination)」とのフロンティアが得られ、選択的回答の改善が学習なしで可能であることを示し、コードも公開されています。

Abstract

Large language models (LLMs) frequently produce confident but incorrect answers, partly because common binary scoring conventions reward answering over honestly expressing uncertainty. We study whether prompt-only interventions -- explicitly announcing reward schemes for answer-versus-abstain decisions plus humility-oriented normative principles -- can reduce hallucination risk without modifying the model. Our focus is epistemic abstention on factual questions with a verifiable answer, where current LLMs often fail to abstain despite being uncertain about their answers. We first assess self-reported verbal confidence as a usable uncertainty signal, showing stability under prompt paraphrasing and reasonable calibration against a token-probability baseline. We then study I-CALM, a prompt-based framework that (i) elicits verbal confidence, (ii) partially rewards abstention through explicit reward schemes, and (iii) adds lightweight normative principles emphasizing truthfulness, humility, and responsibility. Using GPT-5 mini on PopQA as the main setting, we find that confidence-eliciting, abstention-rewarding prompts, especially with norms, reduce the false-answer rate on answered cases mainly by identifying and shifting error-prone cases to abstention and re-calibrating their confidence. This trades coverage for reliability while leaving forced-answer performance largely unchanged. Varying the abstention reward yields a clear abstention-hallucination frontier. Overall, results show the framework can improve selective answering on factual questions without retraining, with the magnitude of effect varying across models and datasets. Code is available at the following https://github.com/binzeli/hallucinationControl.