ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper addresses a key bottleneck in serving MoE-based LLMs for prefill-only workloads (e.g., classification/recommendation/verification), where efficiency is limited by distributed execution overheads rather than raw compute.
It argues that much of the overhead comes from coupling expert placement with synchronous activation routing, a design that was carried over from the autoregressive decoding era.
ZeRO-Prefill introduces AsyncEP to gather expert weights asynchronously (via weight AllGather) instead of using per-layer activation AllToAll, overlapping communication with long, compute-heavy prefill forward passes.
The system also uses prefix-aware routing plus true-FLOPs load tracking with a physically derived saturation threshold to prevent routing imbalance.
Experiments on Qwen3-235B-A22B show 1.35–1.37× throughput gains over the best distributed baseline on real workloads and up to 1.59× on long-context synthetic tests, while achieving 29.8–36.2% per-GPU model FLOPs utilization.

Abstract

Production LLM workloads increasingly serve discriminative tasks, such as classification, recommendation, and verification, whose answers are read from the logits of a single prefill pass with no autoregressive decoding. Serving these prefill-only workloads on mixture-of-experts (MoE) models is bottlenecked not by compute but by the distributed execution required to fit the model: existing parallel strategies (tensor, expert, and pipeline parallelism) trade memory pressure for redundant computation, communication, and synchronization, severely degrading MoE prefill serving efficiency. We observe that these overheads stem from coupling expert placement with synchronous activation routing -- a design inherited from the decoding era. The long, compute-bound forward passes of large-batch prefill open a per-layer window wide enough to stream expert weights in the background, replacing per-layer activation AllToAll with asynchronous weight AllGather fully overlapped with computation. We propose ZeRO-Prefill, a prefill-only serving system whose backend, AsyncEP (Asynchronous Expert Parallelism), gathers experts by weight rather than routing them by activation, and whose frontend co-enforces a physically-derived saturation threshold through prefix-aware routing and true-FLOPs load tracking. On Qwen3-235B-A22B across four hardware/precision configurations, ZeRO-Prefill delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world workloads and up to 1.59x on long-context synthetic workloads, sustaining 29.8-36.2% per-GPU model FLOPs utilization.

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

Key Points

Abstract

Related Articles

SIFS (SIFS Is Fast Search) - local code search for coding agents

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer