Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a limitation of Sparse Upcycling for Mixture-of-Experts (MoE) models, where identical expert initial weights and a randomly initialized router lead to expert symmetry and weak early specialization.
It introduces Cluster-aware Upcycling by clustering dense-model input activations semantically, initializing each expert from its cluster’s subspace via truncated SVD, and initializing router weights using cluster centroids.
To improve training stability and routing quality, the authors add an expert-ensemble self-distillation loss that uses an ensemble teacher to provide reliable routing guidance.
Experiments on CLIP ViT-B/32 and ViT-B/16 show consistent gains over prior methods on both zero-shot and few-shot benchmarks, alongside more diverse and disentangled expert representations.
The approach is reported to reduce inter-expert similarity and produce more confident routing behavior, suggesting better utilization of specialized experts early in training.

Abstract

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.