ActTail: Global Activation Sparsity in Large Language Models

arXiv cs.LG / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

ActTail introduces a TopK magnitude-based activation sparsity method with global allocation for large language models, aiming to reduce compute and memory movement during inference.
It explicitly accounts for heterogeneity in transformer weights by computing a heavy-tail exponent from each projection's empirical spectral density to allocate projection-specific sparsity budgets.
The paper provides a theoretical relationship between activation sparsity ratio and the heavy-tail exponent under the HT-SR regime to guide sparsity decisions beyond heuristic rules.
Experimental results on LLaMA and Mistral show improved perplexity and downstream task performance at high sparsity, with 80% sparsity achieving substantial reductions (e.g., 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, 9.4% on Mistral-7B).

Abstract

Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection's empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.