SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
arXiv cs.LG / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper argues that KV-cache memory is a key bottleneck for real-world LLM serving, especially when systems must handle both low-latency small batches and high-throughput concurrent requests.
- It finds a small set of 4-bit KV-cache quantization techniques that remain practical under deployment constraints like paged memory layouts, regular memory access, and fused attention execution.
- The main recommendation is token-wise INT4 quantization combined with block-diagonal Hadamard rotation, which achieves the best accuracy–efficiency trade-off across multiple models and benchmarks.
- The authors implement a fused rotation-quantization kernel integrated into paged KV-cache layouts, reporting zero measurable end-to-end overhead and matching plain INT4 throughput at different concurrency levels.
- Overall, the work frames KV-cache compression as a systems co-design problem, showing that lightweight Hadamard rotation can deliver near-lossless accuracy without harming serving efficiency.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Enterprise AI Governance Has Shifted from Policy to Execution
Dev.to

Rethinking CNN Models for Audio Classification
Dev.to
v0.20.0rc1
vLLM Releases

Build-in-Public: What I Learned Building an AI Image SaaS
Dev.to
I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw
Dev.to