Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key inference bottleneck in Mixture-of-Experts (MoE) models: the large number of expert activations can significantly increase latency, particularly on resource-constrained deployments.
  • It introduces an “activation budget” framework that limits how many experts can be activated, aiming to prevent the performance degradation seen in prior methods that simply reduce activations.
  • The proposed Alloc-MoE optimizes expert activation allocation at two levels: Alloc-L uses sensitivity profiling with dynamic programming to choose layer-wise allocations, while Alloc-T redistributes activations at the token level using routing scores.
  • Experiments across multiple MoE models show that Alloc-MoE can maintain model performance under constrained activation budgets.
  • On DeepSeek-V2-Lite, Alloc-MoE reports speedups of 1.15× for prefill and 1.34× for decode while using only half of the original activation budget.

Abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves 1.15\times prefill and 1.34\times decode speedups on DeepSeek-V2-Lite at half of the original budget.