MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

arXiv cs.LG / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MoBiE, a post-training binarization framework specifically designed to make Mixture-of-Experts (MoE) LLM inference more efficient under quantization, addressing MoE-specific problems that prior binary methods for dense models miss.
MoBiE combines three techniques—joint SVD decomposition to reduce cross-expert redundancy, global-loss-gradient-enhanced Hessian metrics for better weight importance estimation, and an input-null-space-guided error constraint to limit routing distortion caused by quantization.
The method targets extreme efficiency without increasing storage overhead, aiming to preserve model quality while improving inference characteristics.
Experiments show sizable gains on multiple MoE-based LLMs, including Qwen3-30B-A3B where MoBiE reportedly cuts perplexity by 52.2%, boosts average zero-shot performance by 43.4%, and delivers over 2× inference speedup alongside faster quantization.
The authors provide an open-source implementation, enabling direct evaluation and adoption by researchers and practitioners working on quantized MoE inference.

Abstract

Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2

\%

, improves average zero-shot performance by 43.4

\%

, achieves over 2

\times

inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer