Why Fine-Tuning Encourages Hallucinations and How to Fix It

arXiv cs.AI / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that hallucinations in large language models can increase after supervised fine-tuning (SFT) because new factual learning may degrade or alter knowledge acquired during pre-training.
  • It proposes mitigation strategies based on continual-learning ideas, including a self-distillation-based SFT approach that reduces hallucinations by regularizing output-distribution drift.
  • It also shows that when acquiring new knowledge isn’t needed, freezing selected parameter groups to suppress “factual plasticity” can preserve task performance while lowering hallucinations.
  • The authors investigate why SFT causes hallucinations by testing several hypotheses (capacity limits, behavior cloning effects, and localized interference) and conclude that interference among overlapping semantic representations is a primary driver.
  • Experiments indicate that self-distillation works largely by mitigating this interference, enabling more effective factual learning with less degradation of prior knowledge.

Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

Why Fine-Tuning Encourages Hallucinations and How to Fix It | AI Navigate