ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

arXiv cs.CL / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Distributed LLM training is often limited by communication overhead, and the paper argues that lossless compression has been underused because decompression can cost more than the communication savings.
The authors observe that training communications (activations, gradients, parameters) are often close to a Gaussian distribution, enabling efficient lossless compression.
They introduce ZipCCL, a lossless compressed communication library for LLM collectives, including exponent coding tailored to near-Gaussian tensors and GPU-optimized compression/decompression kernels.
ZipCCL also uses adaptive collective switching to choose operations based on workload and system characteristics, improving performance dynamically.
On a 64-GPU cluster across mixture-of-experts and dense transformer models, ZipCCL cuts communication time by up to 1.35× and improves end-to-end training speed by up to 1.18× without affecting model quality.

Abstract

Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35

\times

and achieves end-to-end training speedups of up to 1.18

\times

without any impact on model quality.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Automating FDA Compliance: AI for Specialty Food Producers

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

I hate this group but not literally

Reddit r/LocalLLaMA

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Automating FDA Compliance: AI for Specialty Food Producers

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

I hate this group but not literally

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer