MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models

arXiv cs.LG / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 本論文は、再学習なし(post-training one-shot)の重みプルーニングで、従来の単一目的(再構成損失や二次テイラー近似など)だけではアーキテクチャやスパース度によって最適性が一貫しない点を指摘している。
  • それを踏まえ、MOONSHOTは単一目的のプルーニング手法をラッパー化し、層ごとの再構成誤差と学習損失の二次テイラー近似を同時に最適化する多目的定式化を提案する。
  • 毂大規模(billion-parameter)でもスケールするために、意思決定のモデリングと逆ヘッセ行列を効率的に計算する手順を導入し、既存の高速 one-shot pruner の効率性を維持することを目標としている。
  • Llama-3.2 / Llama-2 ではC4 perplexityを最大32.6%(2:4スパース度)低減し、ゼロショット分類で最大+4.9点改善、さらにViT/ImageNet-1kで+5点超、ResNet-50で高スパース度(90%)でも+4点の改善を報告している。

Abstract

Weight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.