ICaRus: Identical Cache Reuse for Efficient Multi Model Inference
arXiv cs.LG / 3/17/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- ICaRus proposes Identical Cache Reuse to allow multiple models to share an identical KV cache across all layers, dramatically reducing memory usage in multi-model inference.
- The method conceptualizes a decoder-only Transformer as a logical encoder that generates KV caches and a logical decoder that produces tokens from those caches, enabling the encoder to be frozen while training only the decoder.
- By freezing the encoder and using lightweight adapters like LoRA, ICaRus enables cross-model cache sharing and parallel KV cache generation with next-token prediction to cut recomputation.
- In experiments with eight models, ICaRus achieves up to 11.1x lower P95 latency and 3.8x higher throughput while maintaining comparable accuracy to task-specific fine-tuned baselines.
- The approach eliminates cache memory explosion and evictions in multi-model systems, offering scalable efficiency gains for agentic AI workflows.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory
Dev.to

Nvidia GTC 2026: Jensen Huang Eyes $1 Trillion in Orders as the AI Infrastructure Race Hits Warp Speed
Dev.to