Anatomical Heterogeneity in Transformer Language Models

arXiv cs.LG / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes SmolLM2-135M (30 layers, 135M parameters) using five diagnostic metrics and reveals pronounced anatomical heterogeneity across transformer layers, challenging the assumption of uniform computational budgets.
  • Layer weights show strong mathematical regularity (R2 ≈ 0.91) with a universal oscillatory delta pattern, yet manipulating predicted weights leads to catastrophic nonlinear error accumulation.
  • Layer importance spans a 10^7 range from a critical core (L8-11) to anti-layers (L14, L17), and removing anti-layers can improve performance, revealing a hierarchical importance by layer.
  • Recovery speed correlates with layer importance, indicating differential training requirements across layers, and among five manipulation strategies, only weight scaling (alpha = 0.9) preserves model quality.
  • Growth Transformer Training allocates budget by layer importance and achieves about 54% cost reduction, with a proof-of-concept showing 4.7x lower validation loss than uniform training at identical parameter count and 13% faster execution.

Abstract

Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof-of-concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.

Anatomical Heterogeneity in Transformer Language Models | AI Navigate