Decoupled DiLoCo for Resilient Distributed Pre-training

arXiv cs.CL / 4/24/2026

📰 NewsModels & Research

Key Points

  • The paper argues that SPMD-based distributed pre-training is fragile because tight accelerator coupling makes the whole run stall when any worker slows down or fails.
  • It introduces Decoupled DiLoCo, which breaks lock-step synchronization by running multiple independent learners that perform local optimization and asynchronously send parameter fragments to a central synchronizer.
  • The synchronizer aggregates updates while bypassing failed or straggling learners using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging.
  • The authors report improved training efficiency in failure-prone environments (tested with millions of simulated chips) with zero global downtime, while retaining competitive performance on text and vision tasks for both dense and mixture-of-experts models.

Abstract

Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent ``learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by ``chaos engineering'', we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.