TL;DR:
Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — and this seems to transfer from GPT-2 to Llama.
been experimenting with a simple idea: instead of shrinking model width, just remove entire layers based on sensitivity and then recover with distillation.
Originally tested it on GPT-2 (124M) and it worked pretty well. Decided to try the exact same approach on TinyLlama 1.1B to see if it was just a fluke.
but it wasn’t
GPT-2 (12L → 10L / 9L)
- ~11–17% parameter reduction
- ~9–13% PPL degradation
- ~1.2x decode speedup
TinyLlama 1.1B (22L → 20L / 19L)
- 20L: ~8% smaller, PPL ratio ~1.058
- 19L: ~12% smaller, PPL ratio ~1.081
- 20L gives a clean speedup, 19L is more mixed
Also ran 3 seeds on the 20L setup:
9.72 / 9.72 / 9.70 PPL → basically no variance
A couple things that stood out:
- early/mid layers are consistently easier to drop
- first/last layers are almost always critical
- the “best” layer pair changes after pruning + recovery (model rebalances)
- once the setup is fixed, recovery is surprisingly stable
Takeaway (for me at least):
Removing the right layers seems to preserve structure much better than shrinking everything uniformly.
And more interestingly, the same basic recipe works across architectures — not just GPT-2.
Not claiming anything groundbreaking here, just surprised how cleanly it transferred.
Curious if others have seen similar behavior with depth pruning vs width reduction.
[link] [comments]
