ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

arXiv cs.CV / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ReDiPrune is a training-free, plug-and-play token pruning method for multimodal LLMs that prunes visual tokens before the vision-language projector to cut Transformer compute costs.
  • It selects informative tokens directly from vision encoder outputs using a lightweight scoring rule that balances text-conditioned relevance with max-min diversity to avoid redundancy.
  • By preserving fine-grained spatial and semantic cues (unlike post-projection pruning), ReDiPrune improves the accuracy–efficiency trade-off across multiple image and video benchmarks.
  • On EgoSchema with LLaVA-NeXT-Video-7B, keeping only 15% of visual tokens delivers a +2.0% absolute accuracy gain while reducing computation by over 6× in TFLOPs.
  • The authors provide code for inserting ReDiPrune seamlessly between the vision encoder and projector without retraining or architectural changes.

Abstract

Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15\% of visual tokens yields a +2.0\% absolute accuracy gain while reducing computation by more than 6\times in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.