Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper explains that scaling Mixture-of-Experts (MoE) LLM inference is bottlenecked by expert load imbalance and inefficient token routing, which becomes especially costly in multi-node settings due to heavy inter-node all-to-all communication.
By profiling leading open-source MoE models (Llama 4 Maverick, DeepSeek V3-671B, Qwen3-230B-A22B) using 100k+ real expert activation traces, the authors identify recurring properties such as shifting domain-specific expert usage and a strong link between prefill and decode expert activations.
Based on these activation-pattern findings, they propose workload-aware micro-batch grouping and an expert placement strategy designed to maximize token locality to the target expert.
Experiments across models and datasets show that these optimizations can cut all-to-all communication volume by up to 20%, lowering MoE decode latency while improving accelerator utilization.

Abstract

Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.

I Build Systems, Flip Land, and Drop Trap Music — Meet Tyler Moncrieff aka Father Dust

Dev.to

Whatsapp AI booking system in one prompt in 5 minutes

Dev.to

v0.22.1

Ollama Releases

Launching TotalMedia: A Simpler Way to Fix and Convert Video Files

Dev.to

The best of Cloud Next '26: Gemini Enterprise Agent Platform. The perfect combination of Intelligence and Automation to generate VALUE.

Dev.to

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

Key Points

Abstract

Related Articles

I Build Systems, Flip Land, and Drop Trap Music — Meet Tyler Moncrieff aka Father Dust

Whatsapp AI booking system in one prompt in 5 minutes

v0.22.1

Launching TotalMedia: A Simpler Way to Fix and Convert Video Files

The best of Cloud Next '26: Gemini Enterprise Agent Platform. The perfect combination of Intelligence and Automation to generate VALUE.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer