Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

arXiv cs.CL / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether transformer language models exhibit “scalar variability,” where representational noise scales proportionally with magnitude to yield a constant coefficient of variation seen in biological magnitude systems.
  • Across 26 numerical magnitudes in three 7–8B models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3-8B-Base), the authors find an anti-scalar pattern: representational variability decreases as magnitude increases (scaling exponent alpha ≈ -0.19).
  • The negative scaling persists under multiple checks, including full-dimensional space analysis (alpha ≈ -0.04) and sentence-identity correction (alpha ≈ -0.007), with no primary layers showing alpha > 0 among most layers (0/16).
  • The anti-scalar effect is reported to be 3–5× stronger along the magnitude axis than in orthogonal dimensions, and corpus frequency substantially predicts per-magnitude variability (rho = 0.84).
  • The authors conclude that standard distributional learning in transformers reproduces some log-compressive magnitude geometry but does not produce the biological constant-CV noise signature.

Abstract

Scalar variability -- the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation -- is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.