SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • SpecMoE is presented as a memory-efficient Mixture-of-Experts (MoE) inference system that targets the deployment challenges of high memory usage and sub-optimal parameter efficiency in LLMs.
  • The approach leverages a self-assisted speculative decoding algorithm to improve MoE inference throughput by up to 4.30× without requiring any additional model training or fine-tuning.
  • The work positions speculative decoding as applicable to MoE inference in ways that overcome limitations of existing CPU-offloaded MoE systems, especially under large batch sizes.
  • It reports significant reductions in bandwidth demands for both memory and interconnects, aiming to improve performance on memory-constrained systems.

Abstract

The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to 4.30\times, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.