MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning
arXiv cs.LG / 2026/3/26
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper reports that in MoE models, expert activation is highly skewed across layers, with a small subset of experts handling most tokens while many experts remain rarely used (“cold”).
- It proposes MoE-Sieve, a routing-guided LoRA fine-tuning framework that profiles expert routing on a small calibration set, selects the top-k experts per layer, and applies LoRA only to those selected experts.
- Experiments on two different MoE architectures across three tasks show that adapting only the top 25% of routed experts per layer stays competitive with full LoRA, with mean performance differences within about ±1 percentage point.
- The approach substantially reduces compute and storage needs, cutting trainable LoRA parameters by 70–73%, adapter checkpoint sizes by 71–73%, and training time by up to 50%.
- The authors find that adapting cold experts can increase seed-to-seed variance via gradient noise, and they show that random expert selection under the same budget performs worse while greedy budget optimization does not outperform uniform top-k.



