IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
arXiv cs.LG / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- IceCache is a new KV-cache management approach for long-sequence LLM inference that targets the linear memory growth bottleneck on limited hardware.
- It combines semantic token clustering with PagedAttention and uses a hierarchical, dynamically updatable structure to keep semantically related tokens in contiguous memory regions for more efficient selection and CPU↔GPU transfer.
- Experiments on LongBench show that with a 256-token budget, IceCache preserves about 99% of the accuracy of a full KV-cache baseline.
- Compared with other KV offloading methods, IceCache achieves competitive or better latency/accuracy while requiring only ~25% of the KV-cache token budget, especially benefiting long-generation tasks.
- The paper provides an implementation publicly available at the project site for reproducing and building upon the technique.



