From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing hallucination-detection methods struggle with “stubborn hallucinations,” where LLMs remain confidently wrong.
It proposes a geometric approach called Embedding-Perturbed Gradient Sensitivity (EPGS) that distinguishes stable factual knowledge from brittle memorization.
EPGS works by adding Gaussian noise to input embeddings and measuring the resulting increase in gradient magnitude, using this spike as a proxy for the Hessian spectrum.
Experiments on hallucination detection show EPGS significantly outperforms entropy-based and representation-based baselines, improving detection of high-confidence factual errors.

Abstract

Traditional hallucination detection fails on "Stubborn Hallucinations" -- errors where LLMs are confidently wrong. We propose a geometric solution: Embedding-Perturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient proxy for the Hessian spectrum, differentiating stable knowledge from unstable memorization. Our experiments show that EPGS significantly outperforms entropy-based and representation-based baselines, providing a robust signal for detecting high-confidence factual errors.