SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression
arXiv cs.CL / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces “SoLA,” a training-free LLM compression approach that uses soft activation sparsity to keep only the most inference-relevant components while compressing the rest via low-rank decomposition.
- SoLA is designed around analysis of activation patterns in the feed-forward network (FFN) of modern LLMs, enabling component selection without requiring special hardware or costly post-training.
- To reduce losses from low-rank truncation, the method applies an adaptive component-wise low-rank allocation strategy that chooses truncation positions per weight matrix.
- Experiments across LLaMA-2 (7B/13B/70B) and Mistral-7B show accuracy improvements without post-training, including a reported 30% compression on LLaMA-2-70B that improves perplexity from 6.95 to 4.44 and boosts downstream accuracy by 10% versus prior state-of-the-art.
- The results suggest SoLA can make deploying large LLMs more affordable and practical by shrinking parameter footprints while preserving quality.
Related Articles

Black Hat Asia
AI Business
v0.20.5
Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS
Dev.to
Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.
Reddit r/LocalLLaMA

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System
Dev.to