AI Navigate

We compressed 6 LLMs and found something surprising: they don't degrade the same way

Reddit r/LocalLLaMA / 3/17/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The study compresses MLP layers inside transformer models (no quantization or custom kernels) and evaluates accuracy drops across ARC, HellaSwag, MMLU, and TruthfulQA.
  • Some models are far more compressible than others, e.g., Gemma 2B preserves about 92% accuracy at 14% compression while Llama 3.1 8B drops to around 85% at the same level.
  • The original perplexity improvements did not translate downstream on the benchmarks, challenging prior assumptions about what compression gains mean for downstream tasks.
  • The results reveal a model-specific efficiency frontier: all models degrade smoothly but at very different rates, with tasks like reasoning dropping faster than language-only tasks and RAG/chat tolerating more compression.
  • They provide standard dense HF checkpoints compatible with vLLM/TGI/llama.cpp, require no custom kernels, and can be stacked with quantization; next steps include automatic per-model compression point finding and expanding to more architectures.
We compressed 6 LLMs and found something surprising: they don't degrade the same way

TL;DR: we shrink the MLP layers inside transformers (no quantization, no custom kernels) and measured how accuracy drops across ARC, HellaSwag, MMLU, and TruthfulQA.

We expected similar behavior across models.

We were wrong.

Even more surprising, the original PPL improvements did not translate downstream on the bench.

The key result

Some models are way more compressible than others.

  • Gemma 2B → holds ~92% accuracy at 14% compression
  • Llama 3.1 8B → drops to ~85% at the same compression

Same method. Same % removed. Totally different outcomes.

The efficiency frontier

(chart below)

Each line is a model compressed from 0 → ~40% MLP reduction.

Takeaway:
All models degrade smoothly — but at very different rates.

What stood out

  • Gemma compresses best (flat curve early)
  • Llama degrades fastest (especially larger models)
  • MMLU drops first (reasoning breaks early)
  • TruthfulQA barely moves (language stays intact)

What this means

There isn’t a single “right” compression level.

There’s a model-specific efficiency frontier.

Example:

  • RAG / chat → can tolerate more compression
  • reasoning agents → break quickly

Why this is useful

We output standard dense HF checkpoints:

  • works with vLLM / TGI / llama.cpp
  • no custom kernels
  • stacks with quantization

So you can take on of these smaller dense models and then quantize it too!

What we're exploring next

  • automatically finding the best compression point per model
  • expanding to more architectures
  • understanding why some models compress better
  • improved quality with even deeper compression, still runtime agnostic

Looking for people who find this interesting and have suggestions for models they want compressed like this. It takes me about 25 minutes to do so open to any and all suggestions, insights etc.

Right now we are using PPL under 2.0x baseline to create the frontier, but we could easily optimize around different SLO. I

just need some insight from users as to what they are looking for.

Would be excited to work with anyone who thinks this is cool.

Models + code: https://huggingface.co/dystrio

Curious what others think — where would you actually run these tradeoffs?

https://preview.redd.it/5durtlal2lpg1.png?width=2379&format=png&auto=webp&s=d66e06b3961f280a0f4e00cdb3ceb2c171d13afb

https://preview.redd.it/j237iwzm2lpg1.png?width=2754&format=png&auto=webp&s=8e9a686a07ebbc41dd6bba2b006e69ec753d7dc9

submitted by /u/Quiet_Training_8167
[link] [comments]