Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
arXiv cs.CV / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a limitation of Sparse Upcycling for Mixture-of-Experts (MoE) models, where identical expert initial weights and a randomly initialized router lead to expert symmetry and weak early specialization.
- It introduces Cluster-aware Upcycling by clustering dense-model input activations semantically, initializing each expert from its cluster’s subspace via truncated SVD, and initializing router weights using cluster centroids.
- To improve training stability and routing quality, the authors add an expert-ensemble self-distillation loss that uses an ensemble teacher to provide reliable routing guidance.
- Experiments on CLIP ViT-B/32 and ViT-B/16 show consistent gains over prior methods on both zero-shot and few-shot benchmarks, alongside more diverse and disentangled expert representations.
- The approach is reported to reduce inter-expert similarity and produce more confident routing behavior, suggesting better utilization of specialized experts early in training.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to