| This repository provides a patch for SGLang and vLLM that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.
✅ Supported Models
Any model using DSA indexer benefits from this patch. Via https://xcancel.com/realYushiBai/status/2032299919999189107#m #JustSharing [link] [comments] |
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Reddit r/LocalLLaMA / 3/14/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- IndexCache provides a patch for SGLang and vLLM to accelerate inference for models that use DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.
- The approach enables cross-layer index reuse, eliminating up to 75% of indexer computations and delivering up to 1.82× prefill speedup and 1.48× decode speedup with negligible quality loss.
- The patch requires only a single if/else branch and uses zero additional GPU memory, and it supports the listed models/architectures.
- The patch is contributed by user /u/pmttyji and is hosted on THUDM's IndexCache repository, signaling a practical tooling improvement for the community.



