Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

arXiv cs.LG / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

Mixture-of-Experts (MoE) models perform well on benchmarks, but supervised fine-tuning (SFT) is challenging because MoE router layers are fragile and prone to collapse.
Existing approaches like DenseMixer and ESFT can prevent router collapse using dense mixing or auxiliary load-balancing losses, yet they may introduce noisy gradients that hurt downstream performance.
Preliminary pruning experiments show that even rarely activated (long-tailed) experts contain useful, non-trivial knowledge, since removing them causes noticeable performance drops.
The paper proposes an auxiliary-loss-free MoE SFT method that uses bias-driven sparsification plus always-active gated “condenser” experts to preserve long-tailed expert information without gradient starvation.
Large-scale experiments indicate the proposed approach outperforms DenseMixer and ESFT, with an average improvement of 2.5%+ on mathematical reasoning and CommonsenseQA benchmarks.

Abstract

Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mixing or auxiliary load-balancing losses, but these introduce noisy gradients that often degrade performance. In preliminary experiments, we systematically pruned experts and observed that while certain super experts are activated far more frequently, discarding less used experts still leads to notable performance degradation. This suggests that even rarely activated experts encode non-trivial knowledge useful for downstream tasks. Motivated by this, we propose an auxiliary-loss-free MoE SFT framework that combines bias-driven sparsification with always-active gated condenser experts. Rather than enforcing balanced activation across all experts, our method encourages task-relevant experts to remain active while pushing long-tailed experts toward inactivity. The condenser experts provide a persistent, learnable pathway that alleviates gradient starvation and facilitates consolidation of information that would otherwise remain fragmented across sparsely activated experts. Analysis further suggest that this design better preserves long-tailed expert information under sparse routing. Experiments on large-scale MoE models demonstrate that our approach outperforms state-of-the-art SFT baselines such as DenseMixer and ESFT, achieving average gain of 2.5%+ on both mathematical reasoning and commonsenseQA benchmarks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash

Reddit r/LocalLLaMA

Record $1.1B Seed Funding for Reinforcement Learning Startup

AI Business

The One Substrate Failure Behind Every AI System in 2026

Reddit r/artificial

Into the Omniverse: Manufacturing’s Simulation-First Era Has Arrived

Nvidia AI Blog

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

Key Points

Abstract

Related Articles

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash

Record $1.1B Seed Funding for Reinforcement Learning Startup

The One Substrate Failure Behind Every AI System in 2026

Into the Omniverse: Manufacturing’s Simulation-First Era Has Arrived

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer