MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios
arXiv cs.AI / 3/12/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- MoE-SpAc addresses memory constraints for edge MoE inference by repurposing Speculative Decoding as a memory-aware lookahead mechanism.
- It introduces a Speculative Utility Estimator to forecast expert demand and guide memory allocation and eviction decisions.
- It employs a Heterogeneous Workload Balancer to partition computation via online integer optimization and an Asynchronous Execution Engine to synchronize prefetching and eviction in the same utility space.
- Experimental results show a 42% improvement in throughput (TPS) over the state-of-the-art SD-based baseline and an average 4.04x speedup over standard baselines, with code available at GitHub.
Related Articles
We Scanned 11,529 MCP Servers for EU AI Act Compliance
Dev.to
Still paying 4 years for a tech career
Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER
[P] Inferencing Llama3.2-1B-Instruct on 3xMac Minis M4 with Data Parallelism using allToall architecture! | smolcluster
Reddit r/MachineLearning

Nvidia V100 32 Gb getting 115 t/s on Qwen Coder 30B A3B Q5
Reddit r/LocalLLaMA