HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
arXiv cs.CV / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces HAWK, a training-free visual token pruning method for multimodal LLMs that targets inference-time and compute overhead caused by large numbers of visual tokens.
- It argues that attention heads contribute unevenly to visual understanding, using head importance weights and text-guided attention to estimate which visual tokens are most task-relevant.
- HAWK retains crucial visual information while removing redundant tokens, and is designed to work seamlessly across different MLLMs without retraining.
- Experiments on multiple vision-language benchmarks report state-of-the-art accuracy, including results on Qwen2.5-VL where it preserves 96.0% accuracy while pruning 80.2% of visual tokens.
- The approach also reduces end-to-end latency (to 74.4% of the original) and lowers GPU memory usage, with code released on GitHub.



