Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

arXiv cs.CL / 4/7/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces “Knowledge Packs,” which use pre-computed KV cache injections to deliver RAG knowledge at zero additional token cost, aiming to eliminate the token waste inherent in RAG workflows.
It argues an exact KV-cache equivalence for causal transformers: the KV cache from a forward pass on text F matches the cache produced by a joint pass on F+q, though this equivalence is fragile to chat-template formatting errors.
With correct formatting, experiments report zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, achieving up to 95% token savings versus typical RAG approaches.
The work also claims the KV interface can enable “behavioral steering” that RAG can’t replicate, by applying contrastive deltas to cached values (while noting that key arithmetic breaks coherence due to RoPE behavior).
The authors report steering can be applied concurrently with cached knowledge (using alpha≤0.7) without interference, and that the steering effect primarily occurs in mid-layer value states (33–66%).

Abstract

RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha<=0.7 without interference. No training, no weight modification.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/7DailyView insight →

Black Hat Asia

AI Business

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

The $50,000 Build with MeDo Hackathon is NOW LIVE!

Dev.to

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

The $50,000 Build with MeDo Hackathon is NOW LIVE!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer