PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
arXiv cs.LG / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- PolyKV proposes a shared KV-cache pool for multi-agent LLM inference, replacing the usual per-agent KV cache allocation with one compressed cache injected into multiple agent contexts.
- The approach uses asymmetric compression: keys are quantized to int8 (q8_0) for softmax stability, while values are compressed via TurboQuant (an FWHT rotation followed by 3-bit Lloyd-Max quantization).
- Experiments on SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct across multiple context lengths and up to 15 concurrent agents show a stable 2.91× compression ratio.
- For Llama-3-8B with 15 agents and a 4K-token context, PolyKV cuts KV cache memory from 19.8 GB to 0.45 GB (97.7% reduction) with only +0.57% perplexity degradation and strong BERTScore F1 (0.928).
- The paper reports that perplexity delta does not increase with the number of agents and can even improve at longer coherent contexts, and claims novelty in combining a shared lossy-compressed KV pool with concurrent multi-reader access.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to

An API testing tool built specifically for AI agent loops
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Automatic Error Recovery in AI Agent Networks
Dev.to