| Author here, sharing a preprint we recently released. We're actively looking for feedback from this community before we revise. Motivation. Training large MoEs from scratch is expensive. All expert weights, gradients, and optimizer states must reside in accelerator memory regardless of how few are active per token, and all-to-all communication can consume 45–50% of step time on standard GPU clusters. Both costs scale with total expert count, which is in tension with scaling laws that recommend lower activation ratios (more experts at fixed active parameters) for better quality-per-FLOP. Method. We introduce expert upcycling: given a trained E-expert MoE, we expand to mE experts mid-training by duplicating existing experts and extending the router with small bias noise on replicas. Top-K routing is held fixed, so per-token FLOPs and inference cost are unchanged. Continued pre-training then breaks the symmetry among duplicated experts, driving specialization. The key enabler is loss-free load balancing, which guarantees every replica receives gradient signal and prevents routing collapse. Results. On a 7B→13B interleaved MoE (32→64 experts, Top-2, architecture similar to Llama 4):
We also validate on a full MoE with 256 experts and Top-8 routing (matching DeepSeek-V3, Kimi K2, and GLM-4.5 configurations), showing the approach generalizes beyond the interleaved architecture. Paper: https://huggingface.co/papers/2604.19835 Code and training configurations: github.com/amazon-science/expert-upcycling Happy to discuss the method, ablations (including a practical recipe for transition timing and duplication strategy), the theoretical framing, or training setup in detail — and genuinely interested in pushback on limitations and failure modes we may not have stress-tested. [link] [comments] |
Expert Upcycling: Growing MoE capacity mid-training without increasing inference cost (7B→13B, ~32% GPU hours saved)
Reddit r/LocalLLaMA / 4/24/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The preprint introduces “expert upcycling,” a method to increase Mixture-of-Experts (MoE) capacity mid-training by duplicating existing experts and adding small noise to router replicas.
- By keeping Top-K routing fixed, the approach preserves per-token FLOPs and inference cost even as the total expert count grows (e.g., 7B→13B via 32→64 experts).
- The method relies on loss-free load balancing to ensure every expert replica receives gradient signal, preventing routing collapse and enabling specialization.
- Experiments on a Llama-4-like interleaved MoE show nearly matching validation loss and downstream accuracy versus training a larger fixed-expert model from scratch, while reducing GPU hours by ~32% (and ~67% when an intermediate checkpoint already exists).
- The authors report generalization to larger full-MoE settings (e.g., 256 experts with Top-8), and provide links to the paper and open-source code for further evaluation.
Related Articles
I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.
Reddit r/artificial
Deepseek V4 Flash and Non-Flash Out on HuggingFace
Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API
Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means
Dev.to

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering
Dev.to