Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
arXiv cs.LG / 4/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper reports a systematic empirical study of transformer compression using 40+ experiments on GPT-2 (124M) and Mistral 7B (7.24B), evaluating methods such as spectral compression, block replacement, rotation-based quantization, activation-geometry analysis, and adaptive early exit.
- It finds that high-variance activation directions are largely uncorrelated with predictive directions (via CCA), meaning preserving variance in those subspaces does not reliably preserve predictive quality, as perplexity degrades despite retaining over 90% of variance.
- The study shows that block linearity is conditional on the correct upstream activation distribution: changing earlier blocks causes distribution shift that worsens downstream linear approximations.
- It identifies structural “compression walls,” including error amplification from reconstruction/factorization approaches (cross-terms) and a depth-dependent shift where linearity increases substantially across layers (e.g., Mistral blocks from R^2=0.17 to R^2=0.93).
- For compute reduction, the authors observe that about 30% of tokens are computationally easy, and they demonstrate a strong result from single-block linear replacement (34× compression with only 1.71 perplexity increase on Mistral’s final block), while multi-block replacement underperforms due to residual accumulation and distribution shift.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Why use an AI gateway at all?
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to