| TL;DR: we shrink the MLP layers inside transformers (no quantization, no custom kernels) and measured how accuracy drops across ARC, HellaSwag, MMLU, and TruthfulQA. We expected similar behavior across models. We were wrong. Even more surprising, the original PPL improvements did not translate downstream on the bench. The key resultSome models are way more compressible than others.
Same method. Same % removed. Totally different outcomes. The efficiency frontier(chart below) Each line is a model compressed from 0 → ~40% MLP reduction. Takeaway: What stood out
What this meansThere isn’t a single “right” compression level. There’s a model-specific efficiency frontier. Example:
Why this is usefulWe output standard dense HF checkpoints:
So you can take on of these smaller dense models and then quantize it too! What we're exploring next
Looking for people who find this interesting and have suggestions for models they want compressed like this. It takes me about 25 minutes to do so open to any and all suggestions, insights etc. Right now we are using PPL under 2.0x baseline to create the frontier, but we could easily optimize around different SLO. I just need some insight from users as to what they are looking for. Would be excited to work with anyone who thinks this is cool. Models + code: https://huggingface.co/dystrio Curious what others think — where would you actually run these tradeoffs? [link] [comments] |
We compressed 6 LLMs and found something surprising: they don't degrade the same way
Reddit r/LocalLLaMA / 3/17/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The study compresses MLP layers inside transformer models (no quantization or custom kernels) and evaluates accuracy drops across ARC, HellaSwag, MMLU, and TruthfulQA.
- Some models are far more compressible than others, e.g., Gemma 2B preserves about 92% accuracy at 14% compression while Llama 3.1 8B drops to around 85% at the same level.
- The original perplexity improvements did not translate downstream on the benchmarks, challenging prior assumptions about what compression gains mean for downstream tasks.
- The results reveal a model-specific efficiency frontier: all models degrade smoothly but at very different rates, with tasks like reasoning dropping faster than language-only tasks and RAG/chat tolerating more compression.
- They provide standard dense HF checkpoints compatible with vLLM/TGI/llama.cpp, require no custom kernels, and can be stacked with quantization; next steps include automatic per-model compression point finding and expanding to more architectures.
Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像
Ledge.ai

The programming passion is melting
Dev.to

Best AI Tools for Property Managers in 2026
Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to