We compressed 6 LLMs and found something surprising: they don't degrade the same way

Reddit r/LocalLLaMA / 3/17/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The study compresses MLP layers inside transformer models (no quantization or custom kernels) and evaluates accuracy drops across ARC, HellaSwag, MMLU, and TruthfulQA.
Some models are far more compressible than others, e.g., Gemma 2B preserves about 92% accuracy at 14% compression while Llama 3.1 8B drops to around 85% at the same level.
The original perplexity improvements did not translate downstream on the benchmarks, challenging prior assumptions about what compression gains mean for downstream tasks.
The results reveal a model-specific efficiency frontier: all models degrade smoothly but at very different rates, with tasks like reasoning dropping faster than language-only tasks and RAG/chat tolerating more compression.
They provide standard dense HF checkpoints compatible with vLLM/TGI/llama.cpp, require no custom kernels, and can be stacked with quantization; next steps include automatic per-model compression point finding and expanding to more architectures.

We compressed 6 LLMs and found something surprising: they don't degrade the same way

TL;DR: we shrink the MLP layers inside transformers (no quantization, no custom kernels) and measured how accuracy drops across ARC, HellaSwag, MMLU, and TruthfulQA.

We expected similar behavior across models.

We were wrong.

Even more surprising, the original PPL improvements did not translate downstream on the bench.

The key result

Some models are way more compressible than others.

Gemma 2B → holds ~92% accuracy at 14% compression
Llama 3.1 8B → drops to ~85% at the same compression

Same method. Same % removed. Totally different outcomes.

The efficiency frontier

(chart below)

Each line is a model compressed from 0 → ~40% MLP reduction.

Takeaway:
All models degrade smoothly — but at very different rates.

What stood out

Gemma compresses best (flat curve early)
Llama degrades fastest (especially larger models)
MMLU drops first (reasoning breaks early)
TruthfulQA barely moves (language stays intact)

What this means

There isn’t a single “right” compression level.

There’s a model-specific efficiency frontier.

Example:

RAG / chat → can tolerate more compression
reasoning agents → break quickly

Why this is useful

We output standard dense HF checkpoints:

works with vLLM / TGI / llama.cpp
no custom kernels
stacks with quantization

So you can take on of these smaller dense models and then quantize it too!

What we're exploring next

automatically finding the best compression point per model
expanding to more architectures
understanding why some models compress better
improved quality with even deeper compression, still runtime agnostic

Looking for people who find this interesting and have suggestions for models they want compressed like this. It takes me about 25 minutes to do so open to any and all suggestions, insights etc.

Right now we are using PPL under 2.0x baseline to create the frontier, but we could easily optimize around different SLO. I

just need some insight from users as to what they are looking for.

Would be excited to work with anyone who thinks this is cool.

Models + code: https://huggingface.co/dystrio

Curious what others think — where would you actually run these tradeoffs?

https://preview.redd.it/5durtlal2lpg1.png?width=2379&format=png&auto=webp&s=d66e06b3961f280a0f4e00cdb3ceb2c171d13afb

https://preview.redd.it/j237iwzm2lpg1.png?width=2754&format=png&auto=webp&s=8e9a686a07ebbc41dd6bba2b006e69ec753d7dc9

submitted by /u/Quiet_Training_8167
[link] [comments]

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

We compressed 6 LLMs and found something surprising: they don't degrade the same way

Key Points

The key result

The efficiency frontier

What stood out

What this means

Why this is useful

What we're exploring next

Related Articles

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

The key result

The efficiency frontier

What stood out

What this means

Why this is useful

What we're exploring next

Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像