ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

arXiv cs.LG / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • ELMoE-3D targets a core bottleneck in on-premises MoE serving, where batching turns sparse MoE computation into dense memory activations that become memory-bound.
  • The work combines cache-based acceleration with speculative decoding in a hybrid HW-SW co-designed framework using hybrid-bonding hardware bandwidth.
  • It introduces Elastic Self-Speculative Decoding (Elastic-SD), which jointly scales intrinsic elasticity in MoE experts and bit-and to act as an expert cache and a strongly aligned self-draft model for verification.
  • A custom LSB-augmented bit-sliced architecture enables bit-nested execution by exploiting redundancy in bit-slice representations.
  • On a 3D-stacked xPU system, ELMoE-3D shows average 6.6× speedups and 4.4× energy-efficiency improvements versus naive MoE serving (batch sizes 1–16), and 2.2× speedups and 1.4× energy-efficiency gains over the best prior accelerator baseline.

Abstract

Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average 6.6\times speedup and 4.4\times energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers 2.2\times speedup and 1.4\times energy efficiency gain over the best-performing prior accelerator baseline.