IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

arXiv cs.CL / 3/13/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

IndexCache targets sparse attention inefficiency by reusing cross-layer top-k indices to reduce attention cost from O(L^2) to O(Lk).
It partitions layers into a small set of Full layers with indexers and many Shared layers that reuse the nearest Full layer's top-k indices.
Training-free IndexCache uses a greedy search to select which layers keep indexers by directly minimizing language modeling loss on a calibration set, with no weight updates.
Training-aware IndexCache introduces a multi-layer distillation loss to train retained indexers against the averaged attention distributions of the layers they serve.
Experiments on a 30B Deep Sparse Attention model show up to 75% of indexer computations removed and speedups of up to 1.82x on prefill and 1.48x on decoding, with preliminary evidence on GLM-5.

Abstract

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from

O(L^2)

O(Lk)

. However, the indexer itself retains

O(L^2)

complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82

\times

prefill speedup and 1.48

\times

decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

Run Claude Opus 4.6 via OpenAI-compatible API using your existing Pro/Max subscription

Dev.to

Jupyter AI Extension - Multi-LLM Support

Dev.to

How to Build an AI Team: The Solopreneur Playbook

Dev.to

Run Claude Opus 4.6 as an OpenAI-compatible API using your Pro/Max subscription ($0 extra)

Dev.to

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

Dev.to

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Key Points

Abstract

Related Articles

Run Claude Opus 4.6 via OpenAI-compatible API using your existing Pro/Max subscription

Jupyter AI Extension - Multi-LLM Support

How to Build an AI Team: The Solopreneur Playbook

Run Claude Opus 4.6 as an OpenAI-compatible API using your Pro/Max subscription ($0 extra)

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer