ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

arXiv cs.CL / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • Distributed LLM training is often limited by communication overhead, and the paper argues that lossless compression has been underused because decompression can cost more than the communication savings.
  • The authors observe that training communications (activations, gradients, parameters) are often close to a Gaussian distribution, enabling efficient lossless compression.
  • They introduce ZipCCL, a lossless compressed communication library for LLM collectives, including exponent coding tailored to near-Gaussian tensors and GPU-optimized compression/decompression kernels.
  • ZipCCL also uses adaptive collective switching to choose operations based on workload and system characteristics, improving performance dynamically.
  • On a 64-GPU cluster across mixture-of-experts and dense transformer models, ZipCCL cuts communication time by up to 1.35× and improves end-to-end training speed by up to 1.18× without affecting model quality.

Abstract

Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35\times and achieves end-to-end training speedups of up to 1.18\times without any impact on model quality.