MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning
arXiv cs.LG / 3/26/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper reports that in MoE models, expert activation is highly skewed across layers, with a small subset of experts handling most tokens while many experts remain rarely used (“cold”).
- It proposes MoE-Sieve, a routing-guided LoRA fine-tuning framework that profiles expert routing on a small calibration set, selects the top-k experts per layer, and applies LoRA only to those selected experts.
- Experiments on two different MoE architectures across three tasks show that adapting only the top 25% of routed experts per layer stays competitive with full LoRA, with mean performance differences within about ±1 percentage point.
- The approach substantially reduces compute and storage needs, cutting trainable LoRA parameters by 70–73%, adapter checkpoint sizes by 71–73%, and training time by up to 50%.
- The authors find that adapting cold experts can increase seed-to-seed variance via gradient noise, and they show that random expert selection under the same budget performs worse while greedy budget optimization does not outperform uniform top-k.
Related Articles
5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)
Dev.to
AgentDesk vs Hiring Another Consultant: A Cost Comparison
Dev.to
"Why Your AI Agent Needs a System 1"
Dev.to
When should we expect TurboQuant?
Reddit r/LocalLLaMA
AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia
Dev.to