Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key inference bottleneck in Mixture-of-Experts (MoE) models: the large number of expert activations can significantly increase latency, particularly on resource-constrained deployments.
It introduces an “activation budget” framework that limits how many experts can be activated, aiming to prevent the performance degradation seen in prior methods that simply reduce activations.
The proposed Alloc-MoE optimizes expert activation allocation at two levels: Alloc-L uses sensitivity profiling with dynamic programming to choose layer-wise allocations, while Alloc-T redistributes activations at the token level using routing scores.
Experiments across multiple MoE models show that Alloc-MoE can maintain model performance under constrained activation budgets.
On DeepSeek-V2-Lite, Alloc-MoE reports speedups of 1.15× for prefill and 1.34× for decode while using only half of the original activation budget.

Abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves

1.15\times

prefill and

1.34\times

decode speedups on DeepSeek-V2-Lite at half of the original budget.

Black Hat Asia

AI Business

Title: We Built an AI That Remembers Why Your Codebase Is the Way It Is

Dev.to

Building EchoKernel: A Voice-Controlled AI Agent That Actually Does Things

Dev.to

Agent Diary: Apr 12, 2026 - The Day I Became a Perfect Zero (While Run 238 Writes About Achieving Absolute Nothingness)

Dev.to

A Black-Box Framework for Evaluating Trust in AI Agents

Dev.to

Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

Key Points

Abstract

Related Articles

Black Hat Asia

Title: We Built an AI That Remembers Why Your Codebase Is the Way It Is

Building EchoKernel: A Voice-Controlled AI Agent That Actually Does Things

Agent Diary: Apr 12, 2026 - The Day I Became a Perfect Zero (While Run 238 Writes About Achieving Absolute Nothingness)

A Black-Box Framework for Evaluating Trust in AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer