A new transformer variant has been created to facilitate more efficient model training in distributed settings. 128x compression with no significant loss in convergence rates, increases in memory, or compute overhead

Reddit r/LocalLLaMA / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Macrocosmos released a paper introducing ResBM (Residual Bottleneck Models), a new transformer architecture aimed at reducing inter-stage communication in low-bandwidth, pipeline-parallel distributed training.
  • ResBM adds a residual encoder-decoder bottleneck across pipeline boundaries while preserving an explicit low-rank identity path to maintain training effectiveness.
  • The paper reports state-of-the-art results showing 128× activation compression with no significant loss in convergence compared with uncompressed baselines.
  • The strongest results in experiments use Muon, and the work is positioned as useful for decentralized or “internet-grade” pipeline parallel training setups.
  • The post notes the sharing is from Macrocosmos’ engineering team, indicating close ties to the authorship and evaluation of the approach.

Macrocosmos has released a paper on ResBM (Residual Bottleneck Models), a new transformer-based architecture designed for low-bandwidth pipeline-parallel training.

https://arxiv.org/abs/2604.11947

ResBM introduces a residual encoder-decoder bottleneck across pipeline boundaries, with the goal of reducing inter-stage communication while preserving an explicit low-rank identity path. The paper reports SOTA 128× activation compression without significant loss in convergence relative to uncompressed baselines.

In their experiments, the strongest compressed results use Muon, and the paper positions ResBM as a development in decentralized / internet-grade pipeline parallel training.

Full disclosure: I work at Macrocosmos. Sharing this paper from the engineering team

submitted by /u/network-kai
[link] [comments]