OjaKV: Context-Aware Online Low-Rank KV Cache Compression
arXiv cs.CL / 4/20/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- Long-context LLM generation is limited by large key-value (KV) cache memory usage, which can exceed the model weights for long prompts and common batch sizes.
- Existing low-rank KV-cache compression methods often assume a static, offline-learned subspace and degrade when the input data distribution shifts.
- OjaKV proposes a hybrid strategy that keeps the first and most recent tokens in full-rank while applying low-rank compression to intermediate tokens.
- OjaKV further improves robustness by adapting the low-rank projection basis online using Oja’s algorithm (strong updates during prompt prefill and lightweight updates during decoding), remaining aligned with changing context.
- Experiments show OjaKV is compatible with FlashAttention and can maintain or improve zero-shot accuracy, especially on very long-context reasoning benchmarks, without requiring model fine-tuning.

