ICaRus: Identical Cache Reuse for Efficient Multi Model Inference
arXiv cs.LG / 3/17/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- ICaRus proposes Identical Cache Reuse to allow multiple models to share an identical KV cache across all layers, dramatically reducing memory usage in multi-model inference.
- The method conceptualizes a decoder-only Transformer as a logical encoder that generates KV caches and a logical decoder that produces tokens from those caches, enabling the encoder to be frozen while training only the decoder.
- By freezing the encoder and using lightweight adapters like LoRA, ICaRus enables cross-model cache sharing and parallel KV cache generation with next-token prediction to cut recomputation.
- In experiments with eight models, ICaRus achieves up to 11.1x lower P95 latency and 3.8x higher throughput while maintaining comparable accuracy to task-specific fine-tuned baselines.
- The approach eliminates cache memory explosion and evictions in multi-model systems, offering scalable efficiency gains for agentic AI workflows.
Related Articles
I Was Wrong About AI Coding Assistants. Here's What Changed My Mind (and What I Built About It).
Dev.to

Interesting loop
Reddit r/LocalLLaMA
Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants
Reddit r/LocalLLaMA
I Built the Most Feature-Complete MCP Server for Obsidian — Here's How
Dev.to
A supervisor or "manager" Al agent is the wrong way to control Al
Reddit r/artificial