Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

arXiv cs.LG / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces DMEP (Dynamic Module-wise Expert Pruning), a LoRA-MoE fine-tuning framework that addresses inefficiency from using a fixed, uniform expert setup across different Transformer modules.
DMEP monitors expert usage during training and physically removes low-utility experts separately for each module, producing a smaller, module-tailored expert structure.
Unlike prior approaches that keep enforcing load balancing throughout training, DMEP removes that constraint after pruning so remaining experts can specialize for the downstream task.
Experiments on multiple reasoning benchmarks show DMEP cuts trainable parameters by 35%–43% and improves training throughput by about 10%, while maintaining or improving downstream reasoning accuracy versus uniform LoRA-MoE.
Overall, the method jointly adapts expert capacity per module and reduces optimizer-state overhead, aiming to boost both parameter and training efficiency without sacrificing performance.

Abstract

LoRA-MoE has emerged as an effective paradigm for parameter-efficient fine-tuning, combining the low training cost of LoRA with the increased adaptation capacity of Mixture-of-Experts (MoE). However, existing LoRA-MoE frameworks typically adopt a fixed and uniform expert configuration across heterogeneous Transformer modules (\eg, attention query/key projections and MLP gating networks), ignoring their distinct functional roles and capacity requirements. This design leads to localized over-provisioning, redundant trainable parameters, and unnecessary optimizer-state overhead. Moreover, prior methods enforce load balancing among experts throughout training. Although beneficial in the early stage, this constraint becomes restrictive once routing patterns stabilize, limiting expert specialization on downstream tasks. In this paper, we propose DMEP, a novel LoRA-MoE fine-tuning framework based on Dynamic Module-wise Expert Pruning. DMEP tracks expert utilization during training and physically removes low-utility experts on a per-module basis, yielding a more compact expert structure tailored to different modules. The pruned model then continues training without the load-balancing constraint, freeing the remaining experts to focus entirely on the downstream task and develop specialized expertise. By jointly adapting module-wise expert capacity and eliminating unnecessary balancing, DMEP improves both parameter efficiency and training efficiency. Extensive experiments on multiple reasoning benchmarks show that DMEP reduces trainable parameters by 35\%--43\% and improves training throughput by about 10\%, while maintaining or surpassing the downstream reasoning accuracy of uniform LoRA-MoE baselines.