Before the First Token: Scale-Dependent Emergence of Hallucination Signals in Autoregressive Language Models

arXiv cs.CL / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates when hallucination-indicative internal representations emerge in autoregressive language models by analyzing probe detectability across 7 transformer sizes (117M–7B) and three fact-based datasets (TriviaQA, Simple Facts, Biography).
  • It reports a scale-dependent phase transition: models under ~400M parameters show chance-level factuality probe performance at all generation positions, while models above ~1B exhibit a qualitatively different regime with peak detectability at position zero (before any tokens are generated).
  • Cross-architecture evidence suggests the pre-generation hallucination/factuality signal is statistically significant in both Pythia-1.4B and Qwen2.5-7B, indicating the effect is not tied to a single model family or training corpus.
  • At the 7B scale, instruction tuning vs base training appears to matter: a base Pythia-6.9B model shows a flat temporal profile, while instruction-tuned Qwen2.5-7B shows a dominant pre-generation effect, implying knowledge organization/post-training influences these “knowledge circuits.”
  • The study finds activation steering along probe-derived directions does not fix hallucinations, supporting the conclusion that the measured signal is correlational (useful for detection) rather than causal (useful for direct correction).

Abstract

When do large language models decide to hallucinate? Despite serious consequences in healthcare, law, and finance, few formal answers exist. Recent work shows autoregressive models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood. We study the temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M--7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). We identify a scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy at every generation position (AUC = 0.48--0.67), indicating no reliable factuality signal. Above \sim1B parameters, a qualitatively different regime emerges where peak detectability occurs at position zero -- before any tokens are generated -- then declines during generation. This pre-generation signal is statistically significant in both Pythia-1.4B (p = 0.012) and Qwen2.5-7B (p = 0.038), spanning distinct architectures and training corpora. At the 7B scale, we observe a striking dissociation: Pythia-6.9B (base model, trained on The Pile) produces a flat temporal profile (\Delta = +0.001, p = 0.989), while instruction-tuned Qwen2.5-7B shows a dominant pre-generation effect. This indicates raw scale alone is insufficient -- knowledge organization through instruction tuning or equivalent post-training is required for pre-commitment encoding. Activation steering along probe-derived directions fails to correct hallucinations across all models, confirming the signal is correlational rather than causal. Our findings provide scale-calibrated detection protocols and a concrete hypothesis on instruction tuning's role in developing knowledge circuits supporting factual generation.