Expert Upcycling: Growing MoE capacity mid-training without increasing inference cost (7B→13B, ~32% GPU hours saved)

Reddit r/LocalLLaMA / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The preprint introduces “expert upcycling,” a method to increase Mixture-of-Experts (MoE) capacity mid-training by duplicating existing experts and adding small noise to router replicas.
  • By keeping Top-K routing fixed, the approach preserves per-token FLOPs and inference cost even as the total expert count grows (e.g., 7B→13B via 32→64 experts).
  • The method relies on loss-free load balancing to ensure every expert replica receives gradient signal, preventing routing collapse and enabling specialization.
  • Experiments on a Llama-4-like interleaved MoE show nearly matching validation loss and downstream accuracy versus training a larger fixed-expert model from scratch, while reducing GPU hours by ~32% (and ~67% when an intermediate checkpoint already exists).
  • The authors report generalization to larger full-MoE settings (e.g., 256 experts with Top-8), and provide links to the paper and open-source code for further evaluation.
Expert Upcycling: Growing MoE capacity mid-training without increasing inference cost (7B→13B, ~32% GPU hours saved)

Author here, sharing a preprint we recently released. We're actively looking for feedback from this community before we revise.

Motivation. Training large MoEs from scratch is expensive. All expert weights, gradients, and optimizer states must reside in accelerator memory regardless of how few are active per token, and all-to-all communication can consume 45–50% of step time on standard GPU clusters. Both costs scale with total expert count, which is in tension with scaling laws that recommend lower activation ratios (more experts at fixed active parameters) for better quality-per-FLOP.

Method. We introduce expert upcycling: given a trained E-expert MoE, we expand to mE experts mid-training by duplicating existing experts and extending the router with small bias noise on replicas. Top-K routing is held fixed, so per-token FLOPs and inference cost are unchanged. Continued pre-training then breaks the symmetry among duplicated experts, driving specialization. The key enabler is loss-free load balancing, which guarantees every replica receives gradient signal and prevents routing collapse.

Results. On a 7B→13B interleaved MoE (32→64 experts, Top-2, architecture similar to Llama 4):

  • Validation loss: 1.263 (upcycled) vs. 1.267 (fixed-64 from scratch)
  • Average accuracy across 11 downstream benchmarks: 56.4 vs. 56.7
  • GPU hours: ~32% reduction vs. training the 64-expert model from scratch
  • ~67% reduction in the sunk-cost setting where the 32-expert checkpoint already exists

We also validate on a full MoE with 256 experts and Top-8 routing (matching DeepSeek-V3, Kimi K2, and GLM-4.5 configurations), showing the approach generalizes beyond the interleaved architecture.

Paper: https://huggingface.co/papers/2604.19835

Code and training configurations: github.com/amazon-science/expert-upcycling

Happy to discuss the method, ablations (including a practical recipe for transition timing and duplication strategy), the theoretical framing, or training setup in detail — and genuinely interested in pushback on limitations and failure modes we may not have stress-tested.

https://preview.redd.it/7hzzkopus0xg1.png?width=1084&format=png&auto=webp&s=62481a3e621a221ca4ad45c45abd6db018a25244

submitted by /u/Pigs-On-Wing
[link] [comments]