Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
arXiv cs.CL / 4/7/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces “Knowledge Packs,” which use pre-computed KV cache injections to deliver RAG knowledge at zero additional token cost, aiming to eliminate the token waste inherent in RAG workflows.
- It argues an exact KV-cache equivalence for causal transformers: the KV cache from a forward pass on text F matches the cache produced by a joint pass on F+q, though this equivalence is fragile to chat-template formatting errors.
- With correct formatting, experiments report zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, achieving up to 95% token savings versus typical RAG approaches.
- The work also claims the KV interface can enable “behavioral steering” that RAG can’t replicate, by applying contrastive deltas to cached values (while noting that key arithmetic breaks coherence due to RoPE behavior).
- The authors report steering can be applied concurrently with cached knowledge (using alpha≤0.7) without interference, and that the steering effect primarily occurs in mid-layer value states (33–66%).
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



