Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs

arXiv stat.ML / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a communication-theoretic perspective on Mixture-of-Experts (MoE) gating by treating the gate as a stochastic channel constrained by a finite information rate.
It derives an information-theoretic generalization bound specialized via mutual information and develops a rate–distortion characterization D(R_g) for finite-rate gating, where R_g = I(X;T).
Under an empirical rate–distortion optimality assumption, the authors relate expected generalization error to the distortion term D(R_g) plus additional complexity and sample-size terms.
The results provide capacity-aware limits for communication-constrained MoE systems, explicitly quantifying trade-offs among gating rate, model expressivity, and generalization performance.
Synthetic experiments with multi-expert models empirically validate the predicted relationships between gating rate and generalization.

Abstract

Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, {we specialize a mutual-information generalization bound and develop a rate-distortion characterization

D(R_g)

of finite-rate gating, where

R_g:=I(X; T)

, yielding (under a standard empirical rate-distortion optimality condition)

\mathbb{E}[R(W)] \le D(R_g)+\delta_m+\sqrt{(2/m)\, I(S; W)}

. }The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.