Make Your LVLM KV Cache More Lightweight

arXiv cs.CV / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

KVキャッシュは推論を効率化する重要部品だが、LVLMではプリフィル時に大量の視覚トークンを扱うためGPUメモリ負荷が大きいという課題がある。
提案手法LightKVは、視覚トークン埋め込み間の冗長性を活用し、テキストプロンプトに導かれたクロスモダリティのメッセージパッシングで情報を集約しつつプリフィル中に段階的に圧縮する。
LightKVは視覚だけで圧縮する従来手法と異なり、プロンプトに応じて圧縮を制御する「prompt-aware guidance」を特徴としている。
8つのオープンソースLVLMと8つの公開ベンチマーク（MME、SeedBenchなど）で評価し、視覚トークンを元の55%に抑えても、視覚トークンのKVキャッシュを半減し、計算を最大40%削減しつつ汎用性能を維持でき、既存ベースラインより大きく上回る結果が得られた。

Abstract

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.