| This repository provides a patch for SGLang and vLLM that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.
✅ Supported Models
Any model using DSA indexer benefits from this patch. Via https://xcancel.com/realYushiBai/status/2032299919999189107#m #JustSharing [link] [comments] |
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Reddit r/LocalLLaMA / 3/14/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- IndexCache provides a patch for SGLang and vLLM to accelerate inference for models that use DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.
- The approach enables cross-layer index reuse, eliminating up to 75% of indexer computations and delivering up to 1.82× prefill speedup and 1.48× decode speedup with negligible quality loss.
- The patch requires only a single if/else branch and uses zero additional GPU memory, and it supports the listed models/architectures.
- The patch is contributed by user /u/pmttyji and is hosted on THUDM's IndexCache repository, signaling a practical tooling improvement for the community.
Related Articles
How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers
Dev.to
v1.82.6.rc.1
LiteLLM Releases
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas
Dev.to
How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development
Dev.to