Activation Compression in LLMs: Theoretical Analysis and Efficient Algorithm

arXiv cs.LG / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper studies activation compression for LLM training, arguing it is theoretically less mature than gradient/optimizer-state compression and presenting new LLM-relevant theory and guarantees.
It shows activation compression is safe for linear operators when the compression is unbiased, but can be problematic for nonlinear operators.
The authors derive gradient variance bounds and prove convergence guarantees when applying compression to all linear operators under standard L-smoothness, finding that the convergence rate is unchanged.
Guided by the theory, they propose “activation-gradient co-compression,” which reuses low-rank activation factors to compress gradients for linear-layer backpropagation without extra computation or added gradient error.
Experiments on Qwen and LLaMA across pretraining and fine-tuning benchmarks validate competitive accuracy while improving compression efficiency, and the code is shared for reproducibility.

Abstract

Training large language models (LLMs) is highly memory-intensive, as training must store not only weights and optimizer states but also intermediate activations for backpropagation. While existing memory-efficient methods largely focus on gradients and optimizer states, activation compression is less well established due to the lack of LLM-tailored theory and guarantees. In this work, we develop a theoretical framework showing that activation compression is safe for linear operators when activation compression is unbiased, but problematic for nonlinear ones. We further derive gradient variance bound and establish convergence guarantees for applying activation compression to all linear operators under the standard

L

-smoothness assumption, showing that it does not change the convergence rate. Guided by the theory, we propose an activation-gradient co-compression method that reuses low-rank activation factors to compress linear-layer gradients without extra computation or additional gradient error. We conduct extensive experiments on Qwen and LLaMA models using a pretraining benchmark and multiple fine-tuning benchmarks to validate our theory and demonstrate competitive performance of our method in both accuracy and compression efficiency. We provide our code in the supplementary material for reproducibility.