KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P]

Reddit r/MachineLearning / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • KIV (K-Indexed V Materialization) is a middleware layer that replaces HuggingFace transformers’ standard KV cache with a tiered memory approach, keeping recent KV entries in VRAM and moving older ones to system RAM with retrieval at decode time.
  • The method uses K vectors as a structured, searchable index to select roughly the top ~256 most relevant V entries per decode step, aiming to make decode speed nearly independent of total context length.
  • Benchmarks on a single RTX 4070 (12GB VRAM) with Gemma 4 (4-bit) reportedly support up to 1M tokens using ~12MB KV overhead and ~6.5GB total GPU usage, with needle-in-haystack and phonebook-style retrieval tests performing strongly.
  • The integration is described as a “drop-in” HuggingFace cache replacement that does not modify model weights, does not require retraining/distillation, and should work with any model using DynamicCache (tested across Gemma 4, Qwen2.5, TinyLlama, Phi-3.5; MQA/GQA/MHA).
  • Limitations include some loss for dense similar-looking data under bounded prefill, issues for collision disambiguation/two-hop reasoning (partly attributed to a struggling 4-bit 2B model), and a CPU-to-GPU transfer bottleneck that constrains decode speed improvements.

Been working on this for a bit and figured it was ready to share. KIV (K-Indexed V Materialization) is a middleware layer that replaces the standard KV cache in HuggingFace transformers with a tiered retrieval system. The short version: it keeps recent tokens exact in VRAM, moves old K/V to system RAM, and uses K vectors as a search index to pull back only the ~256 most relevant V entries per decode step.

Results on a 4070 12GB with Gemma 4 E2B (4-bit):

  • 1M tokens, 12MB KIV VRAM overhead, ~6.5GB total GPU usage
  • 4.1 tok/s at 1M context (8-10 tok/s on GPU time), 12.9 tok/s at 4K
  • 70/70 needle-in-haystack tests passed across 4K-32K
  • Perfect phonebook lookup (unique names) at 58K tokens
  • Prefill at 1M takes about 4.3 minutes (one-time cost)
  • Decode is near-constant regardless of context length

The core finding that makes this work: K vectors are smooth and structured, which makes them great search indices. V vectors are high-entropy and chaotic, so don't try to compress them, just retrieve them on demand. Use K to decide which V entries deserve to exist in VRAM at any given step.

No model weights are modified. No retraining or distillation. It hooks into the HuggingFace cache interface and registers a custom attention function. The model has no idea it's talking to a tiered memory system. Works with any model that uses DynamicCache. Tested on Gemma 4, Qwen2.5, TinyLlama, and Phi-3.5 across MQA/GQA/MHA.

There are real limitations and I'm upfront about them in the repo. Bounded prefill loses some info for dense similar-looking data. Collision disambiguation doesn't work but that's the 4-bit 2B model struggling, not the cache. Two-hop reasoning fails for the same reason. CPU RAM scales linearly (5.8GB at 1M tokens).

Still actively optimizing decode speed, especially at longer contexts. The current bottleneck is CPU-to-GPU transfer for retrieved tokens, not the model itself. Plenty of room to improve here.

GitHub: github.com/Babyhamsta/KIV (can be installed as a local pip package, no official pip package yet)

Happy to answer questions about the architecture or results. Would love to see what happens on bigger models with more VRAM if anyone wants to try it.

submitted by /u/ThyGreatOof
[link] [comments]