Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

arXiv cs.LG / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper argues that training-free visual token pruning can reduce Video LLM inference cost, but existing methods often fail on fine-grained video understanding tasks that need precise visual grounding.
  • It identifies “sink tokens” (semantically uninformative tokens that disproportionately attract attention) as a key reason pruning can cause sharp performance collapse.
  • The authors propose Sink-Token-aware Pruning (SToP), a plug-and-play method that assigns a sink score per token and uses it to suppress tokens that are likely to act as sinks.
  • Experiments show SToP improves results across multiple benchmarks (including hallucination evaluation, open-ended generation, compositional reasoning, and MCQA) and works even with aggressive pruning of up to 90% of visual tokens.
  • SToP is applied on top of existing state-of-the-art pruning approaches (VisionZip, FastVid, and Holitom), indicating it can be integrated into current efficient Video LLM pipelines without retraining.

Abstract

Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens--semantically uninformative tokens that attract excessive attention--as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding. Motivated by these insights, we propose Sink-Token-aware Pruning (SToP), a simple yet effective plug-and-play method that introduces a sink score to quantify each token's tendency to behave as a sink and applies this score to existing spatial and temporal pruning methods to suppress them, thereby enhancing video understanding. To validate the effectiveness of SToP, we apply it to state-of-the-art pruning methods (VisionZip, FastVid, and Holitom) and evaluate it across diverse benchmarks covering hallucination, open-ended generation, compositional reasoning, and MCQA. Our results demonstrate that SToP significantly boosts performance, even when pruning up to 90% of visual tokens.