Hallucination as output-boundary misclassification: a composite abstention architecture for language models

arXiv cs.CL / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that hallucinations can be understood as an output-boundary misclassification, where internally generated text is emitted without sufficient grounding in evidence.
It proposes a composite abstention architecture that combines instruction-based refusal with a structural abstention gate using a support-deficit score St derived from self-consistency, paraphrase stability, and citation coverage.
In evaluations across 50 items, five epistemic regimes, and three models, neither instruction-only prompting nor the structural gate alone fully resolved hallucination, with tradeoffs like over-abstention and residual hallucination.
The composite approach improves overall accuracy while lowering hallucinations, but it also inherits some over-abstention behavior from the instruction component and can miss confident confabulations in specific conflicting-evidence settings.
A 100-item no-context stress test based on TruthfulQA suggests the structural gate offers a capability-independent abstention floor, supporting the case for combining both mechanisms.

Abstract

Large language models often produce unsupported claims. We frame this as a misclassification error at the output boundary, where internally generated completions are emitted as if they were grounded in evidence. This motivates a composite intervention that combines instruction-based refusal with a structural abstention gate. The gate computes a support deficit score, St, from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct), and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models, neither mechanism alone was sufficient. Instruction-only prompting reduced hallucination sharply, but still showed over-cautious abstention on answerable items and residual hallucination for GPT-3.5-turbo. The structural gate preserved answerable accuracy across models but missed confident confabulation on conflicting-evidence items. The composite architecture achieved high overall accuracy with low hallucination, while also inheriting some over-abstention from the instruction component. A supplementary 100-item no-context stress test derived from TruthfulQA showed that structural gating provides a capability-independent abstention floor. Overall, instruction-based refusal and structural gating show complementary failure modes, which suggests that effective hallucination control benefits from combining both mechanisms.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/9DailyView insight →

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

The Register

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

v0.20.5

Ollama Releases

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Dev.to

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Key Points

Abstract

💡 Insights using this article

Related Articles

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

v0.20.5

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer