GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
arXiv cs.AI / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- GRASPrune is a post-pretraining structured pruning framework that jointly prunes FFN channels and KV head groups for large language model efficiency under a single global parameter/compute budget.
- It learns lightweight gate scores using a projected straight-through estimator to enforce a hard pruning mask at every step, while keeping the original backbone weights frozen.
- After selecting which units to keep, GRASPrune calibrates scaling factors to reduce scale mismatch from pruning and folds them into the remaining weights to produce a smaller dense checkpoint for inference without extra parameters.
- On LLaMA-2-7B, GRASPrune prunes 50% of parameters and reports WikiText-2 perplexity of 12.18 while maintaining competitive average zero-shot accuracy, using only calibration (no full fine-tuning) on a single NVIDIA A100 80GB GPU.
Related Articles

Enterprise AI Governance Has Shifted from Policy to Execution
Dev.to

Rethinking CNN Models for Audio Classification
Dev.to
v0.20.0rc1
vLLM Releases

Build-in-Public: What I Learned Building an AI Image SaaS
Dev.to
I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw
Dev.to