Does a Global Perspective Help Prune Sparse MoEs Elegantly?

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GRAPE, a global, redundancy-aware pruning strategy for Sparse Mixture-of-Experts (MoE) models that reallocates pruning budgets across layers based on cross-layer redundancy rather than using uniform budgets.
  • Experiments on several MoE LLMs (Mixtral variants, DeepSeek-MoE, Qwen-MoE, and GPT-OSS) show GRAPE delivers the best average performance under the same pruning budget compared with strongest local baselines.
  • For the three main models reported, GRAPE improves average accuracy by 1.40% on average across pruning settings, with up to 2.45% gains in some configurations.
  • The results suggest that MoE pruning can be made more efficient and accurate by explicitly modeling heterogeneous redundancy across the network’s layers.

Abstract

Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.