Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

arXiv cs.CV / 3/26/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces AttentionPack, an adaptive, attention-aware decoding optimization framework aimed at reducing memory overhead in large vision-language model (VLM) inference for long visual/text sequences.
It proposes multi-head attention compaction that leverages an implicit low-rank structure to store key/value matrices more economically during decoding.
It also adds token-specific attention-aware decompression to lower latency costs while maintaining output quality.
Experiments across multiple benchmarks show up to 8× memory-efficiency improvements, enabling larger batch sizes and faster batch inference, or supporting longer contexts for better retrieval.
The authors further report additional efficiency gains when AttentionPack is combined with eviction, quantization, and kernel fusion for resource-limited environments.

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

They Did Not Accidentally Make Work the Answer to Who You Are

Dev.to

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

They Did Not Accidentally Make Work the Answer to Who You Are

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer