SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces “SoLA,” a training-free LLM compression approach that uses soft activation sparsity to keep only the most inference-relevant components while compressing the rest via low-rank decomposition.
SoLA is designed around analysis of activation patterns in the feed-forward network (FFN) of modern LLMs, enabling component selection without requiring special hardware or costly post-training.
To reduce losses from low-rank truncation, the method applies an adaptive component-wise low-rank allocation strategy that chooses truncation positions per weight matrix.
Experiments across LLaMA-2 (7B/13B/70B) and Mistral-7B show accuracy improvements without post-training, including a reported 30% compression on LLaMA-2-70B that improves perplexity from 6.95 to 4.44 and boosts downstream accuracy by 10% versus prior state-of-the-art.
The results suggest SoLA can make deploying large LLMs more affordable and practical by shrinking parameter footprints while preserving quality.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but the billion-scale parameters pose deployment challenges. Although existing methods attempt to reduce the scale of LLMs, they require either special hardware support or expensive post-training to maintain model quality. To facilitate efficient and affordable model slimming, we propose a novel training-free compression method for LLMs, named "SoLA", which leverages \textbf{So}ft activation sparsity and \textbf{L}ow-r\textbf{A}nk decomposition. SoLA can identify and retain a minority of components significantly contributing to inference, while compressing the majority through low-rank decomposition, based on our analysis of the activation pattern in the feed-forward network (FFN) of modern LLMs. To alleviate the decomposition loss, SoLA is equipped with an adaptive component-wise low-rank allocation strategy to assign appropriate truncation positions for different weight matrices. We conduct extensive experiments on LLaMA-2-7B/13B/70B and Mistral-7B models across a variety of benchmarks. SoLA exhibits remarkable improvement in both language modeling and downstream task accuracy without post-training. For example, with a 30\% compression rate on the LLaMA-2-70B model, SoLA surpasses the state-of-the-art method by reducing perplexity from 6.95 to 4.44 and enhancing downstream task accuracy by 10\%.

Black Hat Asia

AI Business

v0.20.5

Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Dev.to

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

Reddit r/LocalLLaMA

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

Dev.to

SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

Key Points

Abstract

Related Articles

Black Hat Asia

v0.20.5

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer