Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper explains that scaling Mixture-of-Experts (MoE) LLM inference is bottlenecked by expert load imbalance and inefficient token routing, which becomes especially costly in multi-node settings due to heavy inter-node all-to-all communication.
  • By profiling leading open-source MoE models (Llama 4 Maverick, DeepSeek V3-671B, Qwen3-230B-A22B) using 100k+ real expert activation traces, the authors identify recurring properties such as shifting domain-specific expert usage and a strong link between prefill and decode expert activations.
  • Based on these activation-pattern findings, they propose workload-aware micro-batch grouping and an expert placement strategy designed to maximize token locality to the target expert.
  • Experiments across models and datasets show that these optimizations can cut all-to-all communication volume by up to 20%, lowering MoE decode latency while improving accelerator utilization.

Abstract

Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.