Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

arXiv cs.LG / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Long-context LLM serving faces rising costs from attending over growing KV caches, and while dynamic sparse attention and hierarchical (CPU+GPU) KV storage can help, system-level gains are often lost due to mismatched granularities and inefficient GPU–CPU retrieval.
The paper introduces SPIN, an inference framework that co-designs the execution pipeline with hierarchical KV storage to preserve the benefits of sparsity end-to-end.
SPIN uses a unified partition abstraction over a shared page-based KV substrate, a locality-aware KV cache manager that adapts HBM budgets and reduces PCIe round-trips, and a two-level hierarchical metadata layout tuned to the active working set.
Evaluations built on vLLM with three sparse-attention algorithms show 1.66–5.66× higher end-to-end throughput, 7–9× lower TTFT, and up to 58% lower TPOT compared with the original sparse-attention implementations.

Abstract

Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/30DailyView insight →

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Key Points

Abstract

💡 Insights using this article

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer