MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

arXiv cs.LG / 2026/3/26

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper reports that in MoE models, expert activation is highly skewed across layers, with a small subset of experts handling most tokens while many experts remain rarely used (“cold”).
It proposes MoE-Sieve, a routing-guided LoRA fine-tuning framework that profiles expert routing on a small calibration set, selects the top-k experts per layer, and applies LoRA only to those selected experts.
Experiments on two different MoE architectures across three tasks show that adapting only the top 25% of routed experts per layer stays competitive with full LoRA, with mean performance differences within about ±1 percentage point.
The approach substantially reduces compute and storage needs, cutting trainable LoRA parameters by 70–73%, adapter checkpoint sizes by 71–73%, and training time by up to 50%.
The authors find that adapting cold experts can increase seed-to-seed variance via gradient noise, and they show that random expert selection under the same budget performs worse while greedy budget optimization does not outperform uniform top-k.

Abstract

Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.

テクノロジー「AI警告危険人物」

note

裏カツ164日目！アメリア#AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

note

ぽんず｜管理職のAI仕事術

note

AIに丸投げしたら「自分の言葉」が消えた40代管理職の話

note

#2 : プロンプト研究講座【第18回】複数キャラクターの関係性の描き方

note

MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

要点

Abstract

関連記事

テクノロジー「AI警告危険人物」

裏カツ164日目！アメリア#AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

ぽんず｜管理職のAI仕事術

AIに丸投げしたら「自分の言葉」が消えた40代管理職の話

#2 : プロンプト研究講座【第18回】複数キャラクターの関係性の描き方

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer