Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

arXiv cs.AI / 4/22/2026

💬 OpinionModels & Research

Key Points

  • The study analyzes hallucinated citations in 9 LLMs using 108,000 generated references and finds that author-name fields fail more often than other citation fields across models and settings.
  • Citation formatting/style does not significantly change citation accuracy, while reasoning-focused distillation can reduce recall of correct citation elements.
  • Field-level hallucination signals are largely non-transferable: probes trained on one citation field only transfer to others at near-chance performance.
  • By applying elastic-net regularization with stability selection to neuron-level CETT values in Qwen2.5-32B-Instruct, the researchers identify a sparse set of field-specific hallucination (FH) neurons, and causal interventions confirm that boosting them increases hallucinations and suppressing them improves citation performance across fields.
  • The work proposes a lightweight detection/mitigation strategy for citation hallucination based on internal neuron signals rather than external supervision.

Abstract

LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.