Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

arXiv cs.CL / 4/7/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Knowledge Packs,” which use pre-computed KV cache injections to deliver RAG knowledge at zero additional token cost, aiming to eliminate the token waste inherent in RAG workflows.
  • It argues an exact KV-cache equivalence for causal transformers: the KV cache from a forward pass on text F matches the cache produced by a joint pass on F+q, though this equivalence is fragile to chat-template formatting errors.
  • With correct formatting, experiments report zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, achieving up to 95% token savings versus typical RAG approaches.
  • The work also claims the KV interface can enable “behavioral steering” that RAG can’t replicate, by applying contrastive deltas to cached values (while noting that key arithmetic breaks coherence due to RoPE behavior).
  • The authors report steering can be applied concurrently with cached knowledge (using alpha≤0.7) without interference, and that the steering effect primarily occurs in mid-layer value states (33–66%).

Abstract

RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha<=0.7 without interference. No training, no weight modification.

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection | AI Navigate