The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions

arXiv cs.CL / 4/28/2026

💬 OpinionModels & Research

Key Points

  • The paper proposes Entropic Deviation (ED), a normalized KL-divergence metric comparing a language model’s token distribution to the uniform distribution to quantify intrinsic non-randomness.
  • Across 31,200 generations over seven models, ED remains substantial even under semantically neutral prompts, suggesting much of the observed non-randomness is embedded in the learned weights rather than being induced by context.
  • Transformer families such as Gemma, Llama, and Qwen show convergent ED values despite differences in training data and vocabularies, indicating a structural property of pretrained transformers.
  • In contrast, the state space model (Mamba2) exhibits a different “regime” with about twice the ED, lower within-sequence variance, and strong temperature sensitivity, while transformers are comparatively insensitive.
  • Cross-lingual tests with Qwen-32B show ED-related gradients that are stable across five languages and persist even when comparing languages that share identical tokeniser subsets, implying language modulates the randomness bound beyond tokenization effects.

Abstract

Language models cannot be random. This paper introduces Entropic Deviation (ED), the normalised KL divergence between a model's token distribution and the uniform distribution, and measures it systematically across 31,200 generations spanning seven models, two architectures (transformer and state space), nine prompt categories, three temperatures, and five languages. Under semantically neutral prompts (empty strings, random characters, nonsense syllables) transformers still exhibit ED of approximately 0.30, meaning that 88-93% of the non-randomness observed under semantic prompts is intrinsic to the learned weights rather than induced by context. Three transformer families (Gemma, Llama, Qwen) converge on nearly identical ED values despite different training data and vocabularies. A state space model (Mamba2) reveals a qualitatively different regime: twice the ED, three times lower within-sequence variance, and massive sensitivity to temperature (r = -0.78) where transformers are nearly immune (r < 0.05). Cross-lingual experiments with Qwen-32B show a stable gradient across five languages (English, Japanese, Chinese, Polish, Arabic) that does not correlate with token fertility and persists when two languages sharing an identical tokeniser subset are compared. These findings establish a structural lower bound on randomness in pretrained language models, characterise how this bound differs across architectures, and demonstrate that language itself modulates the bound independently of tokenisation.