Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports a systematic empirical study of transformer compression using 40+ experiments on GPT-2 (124M) and Mistral 7B (7.24B), evaluating methods such as spectral compression, block replacement, rotation-based quantization, activation-geometry analysis, and adaptive early exit.
  • It finds that high-variance activation directions are largely uncorrelated with predictive directions (via CCA), meaning preserving variance in those subspaces does not reliably preserve predictive quality, as perplexity degrades despite retaining over 90% of variance.
  • The study shows that block linearity is conditional on the correct upstream activation distribution: changing earlier blocks causes distribution shift that worsens downstream linear approximations.
  • It identifies structural “compression walls,” including error amplification from reconstruction/factorization approaches (cross-terms) and a depth-dependent shift where linearity increases substantially across layers (e.g., Mistral blocks from R^2=0.17 to R^2=0.93).
  • For compute reduction, the authors observe that about 30% of tokens are computationally easy, and they demonstrate a strong result from single-block linear replacement (34× compression with only 1.71 perplexity increase on Mistral’s final block), while multi-block replacement underperforms due to residual accumulation and distribution shift.

Abstract

We present a systematic empirical study of transformer compression through over 40 experiments on GPT-2 (124M parameters) and Mistral 7B (7.24B parameters). Our analysis covers spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit. We identify five structural properties relevant to compression. (1) Variance is not importance: high-variance activation directions are approximately 96 percent uncorrelated with predictive directions (measured via CCA), and projecting onto these subspaces preserves over 90 percent of variance while degrading perplexity. (2) Block linearity is conditional: transformer blocks are approximately linear (R^2 ~ 0.95 on GPT-2, 0.93 on Mistral block 31) only under the correct upstream distribution; modifying earlier blocks induces distribution shift that degrades downstream approximations. (3) The reconstruction wall: approaches that factor weights into quantized components amplify errors through cross-terms, making direct quantization strictly superior. (4) Linearity increases with depth: Mistral 7B exhibits a progression from R^2 = 0.17 (block 0) to R^2 = 0.93 (block 31), indicating a division between nonlinear feature construction and linear refinement. (5) Approximately 30 percent of tokens are computationally easy, confirmed via exit heads and KL divergence sensitivity. We demonstrate that single-block linear replacement achieves 34x compression with a 1.71 perplexity increase on the final block of Mistral 7B, while multi-block replacement fails due to residual error accumulation and distribution shift. These findings suggest fundamental limits to static post-training compression and motivate adaptive, per-token computation as a more effective direction.