Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
arXiv cs.LG / 4/2/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper reports large-scale pretraining of dense and Mixture-of-Experts (MoE) language models from scratch on the Aurora exascale system using 1,000s of GPU tiles.
- It introduces “Optimus,” an in-house training library supporting standard large-model techniques and demonstrating pretraining of Mula-1B (dense) and Mula-7B-A1B (MoE) on 3,072 GPUs for 4T tokens.
- The authors scale up MoE training to larger models (Mula-20B-A2B, Mula-100B-A7B, Mula-220B-A10B) and run the largest model up to 100B tokens on the same dataset.
- For Mula-220B-A10B, they increase compute from 384 to 12,288 GPU tiles and report ~90% scaling efficiency, indicating strong throughput gains at extreme parallelism.
- Performance and robustness improvements include custom GPU kernels for expert computation, an EP-aware sharded optimizer with up to 1.71× speedups, and reliability/fault-tolerance features for stable long runs at scale.
Related Articles

Black Hat Asia
AI Business
v5.5.0
Transformers(HuggingFace)Releases
Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke
Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Inference Engines - A visual deep dive into the layers of an LLM
Dev.to