StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

arXiv cs.CL / 4/9/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • StructKV is proposed as a structure-aware KV-cache compression method for million-token-plus long-context LLM inference, aiming to reduce memory/bandwidth bottlenecks without harming long-range behavior.
  • The approach identifies “global information hubs” by computing Global In-Degree Centrality across attention patterns over network depth, rather than relying on single-layer local saliency.
  • It uses Dynamic Pivot Detection with information-theoretic metrics to adaptively choose the best layer for compression, addressing cases where tokens can be globally important but locally dormant.
  • StructKV further separates compute and memory constraints via Structural Propagation and Decoupling, enabling scalable long-context inference.
  • Experiments on LongBench and RULER indicate improved preservation of long-range dependencies and stronger retrieval robustness compared with prior token-pruning/compression methods.

Abstract

As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.