広告

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

Reddit r/artificial / 2026/3/31

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The article reports that “depth-first pruning” (removing the right subset of transformer layers) reduces parameters and improves inference speed with only modest quality loss, outperforming uniform width shrinking in the authors’ tests.
  • Experiments show consistent results when moving from GPT-2 (124M) to TinyLlama 1.1B, with layer counts pruned down by a few layers and performance degradation staying relatively small.
  • The authors find that early/middle layers are more often safe to drop while first and last layers are usually critical, and that the best layer subset can change after pruning and distillation recovery.
  • Recovery via distillation appears stable (including across multiple random seeds in one setup), suggesting the method is robust once the pruning configuration is fixed.
  • A key takeaway is that the same pruning-and-recovery recipe can transfer across architectures (not only GPT-2), motivating further comparison of depth pruning versus width reduction.

TL;DR:
Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — and this seems to transfer from GPT-2 to Llama.

been experimenting with a simple idea: instead of shrinking model width, just remove entire layers based on sensitivity and then recover with distillation.

Originally tested it on GPT-2 (124M) and it worked pretty well. Decided to try the exact same approach on TinyLlama 1.1B to see if it was just a fluke.

but it wasn’t

GPT-2 (12L → 10L / 9L)

  • ~11–17% parameter reduction
  • ~9–13% PPL degradation
  • ~1.2x decode speedup

TinyLlama 1.1B (22L → 20L / 19L)

  • 20L: ~8% smaller, PPL ratio ~1.058
  • 19L: ~12% smaller, PPL ratio ~1.081
  • 20L gives a clean speedup, 19L is more mixed

Also ran 3 seeds on the 20L setup:
9.72 / 9.72 / 9.70 PPL → basically no variance

A couple things that stood out:

  • early/mid layers are consistently easier to drop
  • first/last layers are almost always critical
  • the “best” layer pair changes after pruning + recovery (model rebalances)
  • once the setup is fixed, recovery is surprisingly stable

Takeaway (for me at least):

Removing the right layers seems to preserve structure much better than shrinking everything uniformly.

And more interestingly, the same basic recipe works across architectures — not just GPT-2.

Not claiming anything groundbreaking here, just surprised how cleanly it transferred.

Curious if others have seen similar behavior with depth pruning vs width reduction.

submitted by /u/califalcon
[link] [comments]

広告