MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios
arXiv cs.AI / 3/12/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- MoE-SpAc addresses memory constraints for edge MoE inference by repurposing Speculative Decoding as a memory-aware lookahead mechanism.
- It introduces a Speculative Utility Estimator to forecast expert demand and guide memory allocation and eviction decisions.
- It employs a Heterogeneous Workload Balancer to partition computation via online integer optimization and an Asynchronous Execution Engine to synchronize prefetching and eviction in the same utility space.
- Experimental results show a 42% improvement in throughput (TPS) over the state-of-the-art SD-based baseline and an average 4.04x speedup over standard baselines, with code available at GitHub.
Related Articles
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory
Dev.to

Nvidia GTC 2026: Jensen Huang Eyes $1 Trillion in Orders as the AI Infrastructure Race Hits Warp Speed
Dev.to

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to
How much RAM do I need for my use case?
Reddit r/LocalLLaMA