Hallucination as output-boundary misclassification: a composite abstention architecture for language models

arXiv cs.CL / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that hallucinations can be understood as an output-boundary misclassification, where internally generated text is emitted without sufficient grounding in evidence.
  • It proposes a composite abstention architecture that combines instruction-based refusal with a structural abstention gate using a support-deficit score St derived from self-consistency, paraphrase stability, and citation coverage.
  • In evaluations across 50 items, five epistemic regimes, and three models, neither instruction-only prompting nor the structural gate alone fully resolved hallucination, with tradeoffs like over-abstention and residual hallucination.
  • The composite approach improves overall accuracy while lowering hallucinations, but it also inherits some over-abstention behavior from the instruction component and can miss confident confabulations in specific conflicting-evidence settings.
  • A 100-item no-context stress test based on TruthfulQA suggests the structural gate offers a capability-independent abstention floor, supporting the case for combining both mechanisms.

Abstract

Large language models often produce unsupported claims. We frame this as a misclassification error at the output boundary, where internally generated completions are emitted as if they were grounded in evidence. This motivates a composite intervention that combines instruction-based refusal with a structural abstention gate. The gate computes a support deficit score, St, from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct), and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models, neither mechanism alone was sufficient. Instruction-only prompting reduced hallucination sharply, but still showed over-cautious abstention on answerable items and residual hallucination for GPT-3.5-turbo. The structural gate preserved answerable accuracy across models but missed confident confabulation on conflicting-evidence items. The composite architecture achieved high overall accuracy with low hallucination, while also inheriting some over-abstention from the instruction component. A supplementary 100-item no-context stress test derived from TruthfulQA showed that structural gating provides a capability-independent abstention floor. Overall, instruction-based refusal and structural gating show complementary failure modes, which suggests that effective hallucination control benefits from combining both mechanisms.