Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Researchers studied whether so-called “hallucination neurons” (H-neurons) that predict LLM hallucinations on general QA also transfer across different knowledge domains.
  • They evaluated cross-domain transfer across six domains (general QA, legal, financial, science, moral reasoning, and code vulnerability) using five open-weight LLMs (3B–8B parameters).
  • The H-neuron-based classifiers showed strong in-domain performance (AUROC 0.783) but substantially weaker out-of-domain transfer (AUROC 0.563), indicating a consistent degradation across models.
  • The findings suggest hallucination is not governed by a single universal neural signature; instead, it appears to involve domain-specific neuron populations.
  • As a practical implication, neuron-level hallucination detectors would need domain-specific calibration rather than one-size-fits-all training.

Abstract

Recent work identifies a sparse set of "hallucination neurons" (H-neurons), less than 0.1% of feed-forward network neurons, that reliably predict when large language models will hallucinate. These neurons are identified on general-knowledge question answering and shown to generalize to new evaluation instances. We ask a natural follow-up question: do H-neurons generalize across knowledge domains? Using a systematic cross-domain transfer protocol across 6 domains (general QA, legal, financial, science, moral reasoning, and code vulnerability) and 5 open-weight models (3B to 8B parameters), we find they do not. Classifiers trained on one domain's H-neurons achieve AUROC 0.783 within-domain but only 0.563 when transferred to a different domain (delta = 0.220, p < 0.001), a degradation consistent across all models tested. Our results suggest that hallucination is not a single mechanism with a universal neural signature, but rather involves domain-specific neuron populations that differ depending on the knowledge type being queried. This finding has direct implications for the deployment of neuron-level hallucination detectors, which must be calibrated per domain rather than trained once and applied universally.