Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
arXiv cs.LG / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper explains why MoE LLM inference is difficult on Apple Silicon NPUs, citing unpredictable expert routing, NPU-unfriendly irregular operators, and high overhead from launching many small expert kernels.
- It introduces NPUMoE, a runtime engine that offloads dense, static parts of MoE inference to the NPU while keeping CPU/GPU fallbacks for dynamic operations.
- NPUMoE relies on offline calibration to predict expert capacity and popularity, enabling static expert tiers, grouped expert execution to respect NPU concurrency limits, and load-aware compute-graph residency to cut CPU–NPU synchronization overhead.
- Experiments on Apple M-series devices with three MoE LLMs and four long-context workloads show consistent improvements: 1.32x–5.55x lower latency, 1.81x–7.37x better energy efficiency, and 1.78x–5.54x reduced CPU-cycle usage.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
Autoencoders and Representation Learning in Vision
Dev.to
Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.
Dev.to
Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful
Dev.to
Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to

Now Meta will track what employees do on their computers to train its AI agents
The Verge