MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

arXiv cs.CV / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MI-Pruner, a crossmodal mutual-information-guided token pruning method for multimodal large language models (MLLMs) to improve inference efficiency.
Unlike existing approaches that rank visual token importance using attention scores, MI-Pruner computes mutual information directly between visual and textual feature representations before crossmodal interaction.
The method is designed to be simple and non-intrusive, avoiding the need for access to internal attention maps or architectural changes.
Experiments reported in the paper indicate MI-Pruner outperforms prior attention-based visual pruning techniques while adding minimal latency.

Abstract

For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.