IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes attention in large vision-language models through a “dual form” perspective, viewing attention as an implicit linear layer built from per-token rank-1 updates derived from each token’s key/value pairs.
  • It formulates token pruning as selecting an optimal subset of these rank-1 updates to best approximate the original attention weight matrix, enabling a training-free pruning framework.
  • For softmax attention in LVLMs, the authors derive a new pruning metric that jointly considers a token’s information magnitude and how much information it duplicates with other tokens.
  • To select tokens efficiently using the new metric, the method introduces “Progressive Chunked Maximal Marginal Relevance,” aiming to improve the performance–efficiency tradeoff.
  • Experiments reportedly show the approach achieves better performance versus computation reduction than prior pruning methods, while also providing an interpretive lens on existing techniques.

Abstract

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models | AI Navigate