ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

arXiv cs.LG / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

ELMoE-3D targets a core bottleneck in on-premises MoE serving, where batching turns sparse MoE computation into dense memory activations that become memory-bound.
The work combines cache-based acceleration with speculative decoding in a hybrid HW-SW co-designed framework using hybrid-bonding hardware bandwidth.
It introduces Elastic Self-Speculative Decoding (Elastic-SD), which jointly scales intrinsic elasticity in MoE experts and bit-and to act as an expert cache and a strongly aligned self-draft model for verification.
A custom LSB-augmented bit-sliced architecture enables bit-nested execution by exploiting redundancy in bit-slice representations.
On a 3D-stacked xPU system, ELMoE-3D shows average 6.6× speedups and 4.4× energy-efficiency improvements versus naive MoE serving (batch sizes 1–16), and 2.2× speedups and 1.4× energy-efficiency gains over the best prior accelerator baseline.

Abstract

Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average

6.6\times

speedup and

4.4\times

energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers

2.2\times

speedup and

1.4\times

energy efficiency gain over the best-performing prior accelerator baseline.

FastAPI With LangChain and MongoDB

Dev.to

[Patterns] AI Agent Error Handling That Actually Works

Dev.to

Building ONNX Embedding Workflows in Oracle AI Database with Python

Dev.to

🌱 Green Habit Tracker

Dev.to

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Dev.to

ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

Key Points

Abstract

Related Articles

FastAPI With LangChain and MongoDB

[Patterns] AI Agent Error Handling That Actually Works

Building ONNX Embedding Workflows in Oracle AI Database with Python

🌱 Green Habit Tracker

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer