Why MoE models keep converging on ~10B active parameters

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that many Mixture-of-Experts (MoE) models converge to roughly ~10B active parameters despite having very different total model sizes and expert counts.
It provides a training-cost heuristic of C ≈ 6 × N_active × T, claiming that economics make ~10B active a practical sweet spot for scaling with token budgets.
It compares compute for an MoE setup (e.g., 10B active with 15T tokens) versus a dense 70B model, suggesting MoE can achieve similar outcomes at a fraction of the compute.
It raises an open question about inference-time memory scaling when the number of experts increases but the number of active parameters remains fixed.
It suggests that KV cache likely dominates inference memory beyond around 32k context length, potentially limiting the benefits of adding more experts without increasing active parameters.

Interesting pattern: despite wildly different total sizes, many recent MoE models land around 10B active params. Qwen 3.5 122B activates 10B. MiniMax M2.7 runs 230B total with 10B active via Top 2 routing.

Training cost scales as C ≈ 6 × N_active × T. At 10B active and 15T tokens, you get ~9e23 FLOPs, roughly 1/7th of a dense 70B on equivalent data. The economics practically force this convergence.

Has anyone measured real inference memory scaling when expert count increases but active params stay fixed? KV cache seems to dominate past 32k context regardless.

submitted by /u/Spare_Pair_9198
[link] [comments]