Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
arXiv cs.LG / 4/28/2026
📰 NewsModels & Research
Key Points
- Mixture-of-Experts (MoE) models perform well on benchmarks, but supervised fine-tuning (SFT) is challenging because MoE router layers are fragile and prone to collapse.
- Existing approaches like DenseMixer and ESFT can prevent router collapse using dense mixing or auxiliary load-balancing losses, yet they may introduce noisy gradients that hurt downstream performance.
- Preliminary pruning experiments show that even rarely activated (long-tailed) experts contain useful, non-trivial knowledge, since removing them causes noticeable performance drops.
- The paper proposes an auxiliary-loss-free MoE SFT method that uses bias-driven sparsification plus always-active gated “condenser” experts to preserve long-tailed expert information without gradient starvation.
- Large-scale experiments indicate the proposed approach outperforms DenseMixer and ESFT, with an average improvement of 2.5%+ on mathematical reasoning and CommonsenseQA benchmarks.
Related Articles

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI
Dev.to
Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash
Reddit r/LocalLLaMA

Record $1.1B Seed Funding for Reinforcement Learning Startup
AI Business

The One Substrate Failure Behind Every AI System in 2026
Reddit r/artificial

Into the Omniverse: Manufacturing’s Simulation-First Era Has Arrived
Nvidia AI Blog