When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

arXiv cs.CV / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies “attention sinks” in Large Vision-Language Models (LVLMs), defining them as tokens that attract disproportionate attention and examining how this behavior transfers across modalities.
It categorizes visual attention sinks into two types—ViT-emerged sinks (V-sinks) originating from the vision encoder and LLM-emerged sinks (L-sinks) arising within the deep LLM layers.
The analysis finds a performance trade-off: sinks can help by encoding global scene-level priors, but excessive dominance can suppress fine-grained visual evidence needed for local perception.
The authors identify which functional layers most strongly affect downstream performance when sinks are modulated and propose Layer-wise Sink Gating (LSG).
LSG is a lightweight, plug-and-play module trained with standard next-token prediction while freezing the LVLM backbone, and it improves performance on multimodal benchmarks by balancing global reasoning with local visual precision.

Abstract

Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer