IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Reddit r/LocalLLaMA / 3/14/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

IndexCache provides a patch for SGLang and vLLM to accelerate inference for models that use DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.
The approach enables cross-layer index reuse, eliminating up to 75% of indexer computations and delivering up to 1.82× prefill speedup and 1.48× decode speedup with negligible quality loss.
The patch requires only a single if/else branch and uses zero additional GPU memory, and it supports the listed models/architectures.
The patch is contributed by user /u/pmttyji and is hosted on THUDM's IndexCache repository, signaling a practical tooling improvement for the community.

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

This repository provides a patch for SGLang and vLLM that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.

TL;DR: IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse — achieving up to 1.82× prefill speedup and 1.48× decode speedup with negligible quality degradation. One if/else branch, zero extra GPU memory.

	Baseline	IndexCache (1/4)	Speedup
Prefill (200K)	19.5s	10.7s	1.82×
Decode (200K)	58 tok/s	86 tok/s	1.48×

✅ Supported Models

Model	Architecture	Supported
DeepSeek-V3.2	`DeepseekV32ForCausalLM`	✅
GLM-5 (744B)	`GlmMoeDsaForCausalLM`	✅

Any model using DSA indexer benefits from this patch.

Via https://xcancel.com/realYushiBai/status/2032299919999189107#m

#JustSharing

submitted by /u/pmttyji
[link] [comments]

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

Dev.to

v1.82.6.rc.1

LiteLLM Releases

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

Dev.to

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

Dev.to

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Key Points

✅ Supported Models

Related Articles

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

v1.82.6.rc.1

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer