Generalization and Scaling Laws for Mixture-of-Experts Transformers

arXiv cs.LG / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper develops a theory for generalization and neural scaling in Mixture-of-Experts (MoE) Transformers by separating active per-input capacity from routing combinatorics.
  • It derives a sup-norm covering-number bound using conditioning on fixed routing patterns and union bounds, yielding a generalization guarantee that accounts for MoE-specific routing overhead.
  • Using a manifold data model and smooth target assumptions ($C^\beta$), the work characterizes the approximation–estimation tradeoff similarly to dense networks once active parameters are properly included.
  • It proves constructive approximation results for MoE, showing that error can improve by increasing active capacity or by adding more experts depending on which bottleneck dominates.
  • The authors translate the theory into neural scaling laws for model size, data size, and compute-optimal tradeoffs, clarifying which scaling behaviors are supported by worst-case statistical guarantees versus those requiring data-dependent routing or optimization effects.

Abstract

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a d-dimensional manifold data model and C^\beta targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.