MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports that in MoE models, expert activation is highly skewed across layers, with a small subset of experts handling most tokens while many experts remain rarely used (“cold”).
  • It proposes MoE-Sieve, a routing-guided LoRA fine-tuning framework that profiles expert routing on a small calibration set, selects the top-k experts per layer, and applies LoRA only to those selected experts.
  • Experiments on two different MoE architectures across three tasks show that adapting only the top 25% of routed experts per layer stays competitive with full LoRA, with mean performance differences within about ±1 percentage point.
  • The approach substantially reduces compute and storage needs, cutting trainable LoRA parameters by 70–73%, adapter checkpoint sizes by 71–73%, and training time by up to 50%.
  • The authors find that adapting cold experts can increase seed-to-seed variance via gradient noise, and they show that random expert selection under the same budget performs worse while greedy budget optimization does not outperform uniform top-k.

Abstract

Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.