CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

arXiv cs.AI / 5/7/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CCL-D, a high-precision diagnostic system aimed at detecting and localizing slow/hang communication anomalies during large-scale distributed model training.
  • CCL-D combines a rank-level real-time probe with an intelligent decision analyzer that uses lightweight distributed tracing to compute cross-layer anomaly metrics from communication traffic.
  • The analyzer automates both anomaly detection and root-cause localization, enabling precise identification of the faulty GPU rank responsible for the slowdown/hang.
  • In a 4,000-GPU cluster deployment over one year, CCL-D delivered near-complete coverage of known slow/hang anomalies and typically pinpointed affected ranks within 6 minutes, outperforming prior approaches.

Abstract

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.