FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

arXiv cs.LG / 4/6/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a key MoE inference bottleneck: expert weights often sit idle in GPU memory while the KV cache (which drives throughput) is performance-critical, causing memory underutilization and degraded serving performance.
FluxMoE is proposed as a new MoE inference system that decouples expert parameters from persistent GPU residency using an expert paging abstraction that streams expert weights on demand.
By materializing experts only when needed and evicting them immediately after use, FluxMoE prioritizes GPU memory for throughput-critical runtime state like the KV cache.
The system is implemented on top of vLLM, targeting efficient MoE inference under severe memory constraints.
Experiments report up to 3.0× throughput gains over vLLM in memory-intensive scenarios without sacrificing model fidelity.

Abstract

Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0