Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

Reddit r/artificial / 2026/3/31

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The article reports that “depth-first pruning” (removing the right subset of transformer layers) reduces parameters and improves inference speed with only modest quality loss, outperforming uniform width shrinking in the authors’ tests.
Experiments show consistent results when moving from GPT-2 (124M) to TinyLlama 1.1B, with layer counts pruned down by a few layers and performance degradation staying relatively small.
The authors find that early/middle layers are more often safe to drop while first and last layers are usually critical, and that the best layer subset can change after pruning and distillation recovery.
Recovery via distillation appears stable (including across multiple random seeds in one setup), suggesting the method is robust once the pruning configuration is fixed.
A key takeaway is that the same pruning-and-recovery recipe can transfer across architectures (not only GPT-2), motivating further comparison of depth pruning versus width reduction.

TL;DR:
Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — and this seems to transfer from GPT-2 to Llama.

been experimenting with a simple idea: instead of shrinking model width, just remove entire layers based on sensitivity and then recover with distillation.

Originally tested it on GPT-2 (124M) and it worked pretty well. Decided to try the exact same approach on TinyLlama 1.1B to see if it was just a fluke.

but it wasn’t

GPT-2 (12L → 10L / 9L)

~11–17% parameter reduction
~9–13% PPL degradation
~1.2x decode speedup

TinyLlama 1.1B (22L → 20L / 19L)

20L: ~8% smaller, PPL ratio ~1.058
19L: ~12% smaller, PPL ratio ~1.081
20L gives a clean speedup, 19L is more mixed

Also ran 3 seeds on the 20L setup:
9.72 / 9.72 / 9.70 PPL → basically no variance

A couple things that stood out:

early/mid layers are consistently easier to drop
first/last layers are almost always critical
the “best” layer pair changes after pruning + recovery (model rebalances)
once the setup is fixed, recovery is surprisingly stable

Takeaway (for me at least):

Removing the right layers seems to preserve structure much better than shrinking everything uniformly.

And more interestingly, the same basic recipe works across architectures — not just GPT-2.

Not claiming anything groundbreaking here, just surprised how cleanly it transferred.

Curious if others have seen similar behavior with depth pruning vs width reduction.

submitted by /u/califalcon
[link] [comments]

Black Hat Asia

AI Business

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

日経XTECH

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化

日経XTECH

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

Qiita

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

Qiita

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

要点

関連記事

Black Hat Asia

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

関連記事

Black Hat Asia

ラピダスCTO、1ナノでTSMCと「半年差に」 まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機 買って分かったAIの進化

RotorQuant vs TurboQuant — KVキャッシュ量子化の最前線

【備忘録】分類モデルの基本的な評価指標（Accuracy / Recall / Precision / F1スコア）まとめ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

ラピダスCTO、1ナノでTSMCと「半年差に」まずは信頼獲得から

「Galaxy S26 Ultra」、のぞき見防ぐ最上機買って分かったAIの進化