Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

arXiv cs.LG / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper explains why MoE LLM inference is difficult on Apple Silicon NPUs, citing unpredictable expert routing, NPU-unfriendly irregular operators, and high overhead from launching many small expert kernels.
It introduces NPUMoE, a runtime engine that offloads dense, static parts of MoE inference to the NPU while keeping CPU/GPU fallbacks for dynamic operations.
NPUMoE relies on offline calibration to predict expert capacity and popularity, enabling static expert tiers, grouped expert execution to respect NPU concurrency limits, and load-aware compute-graph residency to cut CPU–NPU synchronization overhead.
Experiments on Apple M-series devices with three MoE LLMs and four long-context workloads show consistent improvements: 1.32x–5.55x lower latency, 1.81x–7.37x better energy efficiency, and 1.78x–5.54x reduced CPU-cycle usage.

Abstract

Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/22DailyView insight →

Autoencoders and Representation Learning in Vision

Dev.to

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

Key Points

Abstract

💡 Insights using this article

Related Articles

Autoencoders and Representation Learning in Vision

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer