Speculating Experts Accelerates Inference for Mixture-of-Experts

arXiv cs.AI / 3/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The authors propose an expert prefetching scheme for mixture-of-experts models that uses currently computed internal representations to speculate which experts will be needed next, enabling memory transfers to overlap with computation.
They demonstrate that future experts can be reliably predicted across multiple MoE architectures, preserving downstream task accuracy while improving compute-memory overlap.
Integrated into an optimized inference engine, the method yields up to a 14% reduction in time per output token (TPOT) compared with on-demand loading from CPU memory.
When speculative execution risks accuracy, they explore lightweight estimators to improve expert-prediction hit rates and minimize performance degradation.
The work is open-sourced with code released at the provided GitHub URL, facilitating adoption and integration.

Abstract

Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts. Integrated into an optimized inference engine, our approach achieves up to 14\% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open-source at https://github.com/axonn-ai/yalis/tree/offload_prefetch.

Lemonade 10.0.1 improves setup process for using AMD Ryzen AI NPUs on Linux

Reddit r/artificial

The 2026 Developer Showdown: Claude Code vs. Google Antigravity

Dev.to

CRM Development That Drives Growth

Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills

Dev.to

How to Write AI Prompts That Actually Work

Dev.to

Speculating Experts Accelerates Inference for Mixture-of-Experts

Key Points

Abstract

Related Articles

Lemonade 10.0.1 improves setup process for using AMD Ryzen AI NPUs on Linux

The 2026 Developer Showdown: Claude Code vs. Google Antigravity

CRM Development That Drives Growth

Karpathy's Autoresearch: Improving Agentic Coding Skills

How to Write AI Prompts That Actually Work

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer