DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

arXiv cs.LG / 3/20/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The article introduces DyMoE, a dynamic mixed-precision quantization framework designed to reduce memory footprint and I/O overhead for MoE models on edge devices to enable real-time inference.
It leverages importance-aware prioritization to quantize experts at runtime based on skewed expert importance and depth-dependent sensitivity.
It employs depth-adaptive scheduling to preserve semantic integrity in critical layers and look-ahead prefetching to overlap I/O stalls.
Experimental results on commercial edge hardware show a Time-to-First-Token reduction of 3.44x–22.7x and up to a 14.58x speedup in Time-Per-Output-Token, enabling real-time, accuracy-preserving MoE inference on constrained devices.

Abstract

Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.

We Scanned 11,529 MCP Servers for EU AI Act Compliance

Dev.to

Still paying 4 years for a tech career

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

[P] Inferencing Llama3.2-1B-Instruct on 3xMac Minis M4 with Data Parallelism using allToall architecture! | smolcluster

Reddit r/MachineLearning

Nvidia V100 32 Gb getting 115 t/s on Qwen Coder 30B A3B Q5

Reddit r/LocalLLaMA

DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Key Points

Abstract

Related Articles

We Scanned 11,529 MCP Servers for EU AI Act Compliance

Still paying 4 years for a tech career

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

[P] Inferencing Llama3.2-1B-Instruct on 3xMac Minis M4 with Data Parallelism using allToall architecture! | smolcluster

Nvidia V100 32 Gb getting 115 t/s on Qwen Coder 30B A3B Q5

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer