Anatomical Heterogeneity in Transformer Language Models

arXiv cs.LG / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes SmolLM2-135M (30 layers, 135M parameters) using five diagnostic metrics and reveals pronounced anatomical heterogeneity across transformer layers, challenging the assumption of uniform computational budgets.
Layer weights show strong mathematical regularity (R2 ≈ 0.91) with a universal oscillatory delta pattern, yet manipulating predicted weights leads to catastrophic nonlinear error accumulation.
Layer importance spans a 10^7 range from a critical core (L8-11) to anti-layers (L14, L17), and removing anti-layers can improve performance, revealing a hierarchical importance by layer.
Recovery speed correlates with layer importance, indicating differential training requirements across layers, and among five manipulation strategies, only weight scaling (alpha = 0.9) preserves model quality.
Growth Transformer Training allocates budget by layer importance and achieves about 54% cost reduction, with a proof-of-concept showing 4.7x lower validation loss than uniform training at identical parameter count and 13% faster execution.

Abstract

Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof-of-concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

They Did Not Accidentally Make Work the Answer to Who You Are

Dev.to

Anatomical Heterogeneity in Transformer Language Models

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

They Did Not Accidentally Make Work the Answer to Who You Are

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer