Why Fine-Tuning Encourages Hallucinations and How to Fix It

arXiv cs.AI / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that hallucinations in large language models can increase after supervised fine-tuning (SFT) because new factual learning may degrade or alter knowledge acquired during pre-training.
It proposes mitigation strategies based on continual-learning ideas, including a self-distillation-based SFT approach that reduces hallucinations by regularizing output-distribution drift.
It also shows that when acquiring new knowledge isn’t needed, freezing selected parameter groups to suppress “factual plasticity” can preserve task performance while lowering hallucinations.
The authors investigate why SFT causes hallucinations by testing several hypotheses (capacity limits, behavior cloning effects, and localized interference) and conclude that interference among overlapping semantic representations is a primary driver.
Experiments indicate that self-distillation works largely by mitigating this interference, enabling more effective factual learning with less degradation of prior knowledge.

Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

Reddit r/MachineLearning

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Dev.to

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

Dev.to

The $20/month AI subscription is gaslighting developers in emerging markets

Dev.to

A Claude Code hook that warns you before calling a low-trust MCP server

Dev.to

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Key Points

Abstract

Related Articles

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

The $20/month AI subscription is gaslighting developers in emerging markets

A Claude Code hook that warns you before calling a low-trust MCP server

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer