Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

arXiv cs.LG / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper argues that training-free visual token pruning can reduce Video LLM inference cost, but existing methods often fail on fine-grained video understanding tasks that need precise visual grounding.
It identifies “sink tokens” (semantically uninformative tokens that disproportionately attract attention) as a key reason pruning can cause sharp performance collapse.
The authors propose Sink-Token-aware Pruning (SToP), a plug-and-play method that assigns a sink score per token and uses it to suppress tokens that are likely to act as sinks.
Experiments show SToP improves results across multiple benchmarks (including hallucination evaluation, open-ended generation, compositional reasoning, and MCQA) and works even with aggressive pruning of up to 90% of visual tokens.
SToP is applied on top of existing state-of-the-art pruning approaches (VisionZip, FastVid, and Holitom), indicating it can be integrated into current efficient Video LLM pipelines without retraining.

Abstract

Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens--semantically uninformative tokens that attract excessive attention--as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding. Motivated by these insights, we propose Sink-Token-aware Pruning (SToP), a simple yet effective plug-and-play method that introduces a sink score to quantify each token's tendency to behave as a sink and applies this score to existing spatial and temporal pruning methods to suppress them, thereby enhancing video understanding. To validate the effectiveness of SToP, we apply it to state-of-the-art pruning methods (VisionZip, FastVid, and Holitom) and evaluate it across diverse benchmarks covering hallucination, open-ended generation, compositional reasoning, and MCQA. Our results demonstrate that SToP significantly boosts performance, even when pruning up to 90% of visual tokens.

Black Hat USA

AI Business

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

Dev.to

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

Key Points

Abstract

Related Articles

Black Hat USA

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer