MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

arXiv cs.AI / 3/12/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

MoE-SpAc addresses memory constraints for edge MoE inference by repurposing Speculative Decoding as a memory-aware lookahead mechanism.
It introduces a Speculative Utility Estimator to forecast expert demand and guide memory allocation and eviction decisions.
It employs a Heterogeneous Workload Balancer to partition computation via online integer optimization and an Asynchronous Execution Engine to synchronize prefetching and eviction in the same utility space.
Experimental results show a 42% improvement in throughput (TPS) over the state-of-the-art SD-based baseline and an average 4.04x speedup over standard baselines, with code available at GitHub.

Abstract

Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE-SpAc .

We Scanned 11,529 MCP Servers for EU AI Act Compliance

Dev.to

Still paying 4 years for a tech career

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

[P] Inferencing Llama3.2-1B-Instruct on 3xMac Minis M4 with Data Parallelism using allToall architecture! | smolcluster

Reddit r/MachineLearning

Nvidia V100 32 Gb getting 115 t/s on Qwen Coder 30B A3B Q5

Reddit r/LocalLLaMA

MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

Key Points

Abstract

Related Articles

We Scanned 11,529 MCP Servers for EU AI Act Compliance

Still paying 4 years for a tech career

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

[P] Inferencing Llama3.2-1B-Instruct on 3xMac Minis M4 with Data Parallelism using allToall architecture! | smolcluster

Nvidia V100 32 Gb getting 115 t/s on Qwen Coder 30B A3B Q5

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer