The exact KV cache usage of DeepSeek V4

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The post analyzes the DeepSeek V4 (DSV4) paper’s implications on how much KV cache is required at extremely long contexts (up to 1M tokens), arguing the figure for DSV3.2 vs DSV4 was interpreted differently.
  • Using the author’s own FP16 KV-cache calculations, DSV3.2 is estimated at ~68.6GiB for 1M context (not ~50GiB), while DSV4 Flash and DSV4 Pro are estimated at about ~6.08GiB and ~8.71GiB respectively.
  • Compared with DSV3.2, the KV-cache reduction is estimated at ~7.88× (not 9.5×), though the KV% metric still indicates close to ~20× improvement.
  • The author suggests that once DSV4 is supported in llama.cpp, long-context runs could become practical on consumer-to-prosumer hardware (e.g., 256GB RAM with a 3090 for Flash, and much larger RAM/GPUs for Pro), and notes that derivatives from other Chinese AI teams may further reduce KV cache usage.

Figure 1 of DSV4 paper seems to imply that DSV3.2 uses ~50GB at 1m context and DSV4 uses

~5GB:

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

From my own calculations, the correct FP16 KV cache at 1m context should be:

Model Params 128k 160k 1m KV%
V3.x 671B 8.58GiB 10.72GiB 68.63GiB 5.11%
V4 Flash 284B 0.76GiB 0.95GiB 6.08GiB 1.07%
V4 Pro 1600B 1.09GiB 1.36GiB 8.71GiB 0.272%

So while KV cache saving is not 9.5x but 7.879x. It is still very impressive. If you look at the KV% metric, then we are seeing close to 20x gain. This basically obliterates all current transformer-SSM hybrid models' KV cache usage. But the transformer-SSM crowd can just use DSV4's CSA and HCA on their transformer layers to catch up.

At this KV cache usage, that also means when DSV4 is supported at llama.cpp, we can easily run 1m context for DSV4 Flash on 256GB RAM and 3090 or for DSV4 Pro on 1.5TB RAM and RTX 6000 Blackwell. I suppose the various speed gain mentioned in the paper can make this viable.

While DSV4 Pro doesn't do well at artificial analysis. We can expect Kimi and Zhipu will make derivatives off it such that we have a beast that uses very little KV cache.

All in all, DS is still doing very well as the research backbone of the Chinese AI scene.

PS More detailed calculations for people interested. Please let me know if I did any math wrong:

Based on what I see by actually running V3.2 with llama.cpp, the actual FP16 KV cache usage for DSV3.2 is 10.72GiB at 160k context and 68.625GiB at hypothetical 1m context.

This number can be validated with the per token per layer MLA KV cache formula:(kv_lora_rank + qk_rope_head_dim) * precision = (512 + 64) * 2 = 1152 bytes. So for 61 layers and 1m token, it will be 1152*61*1024*1024 = 68.625GiB which is not 50GB.

On the other hand, for DSV4 Pro, it has 30 CSA layers and 31 HCA layers interleaved. My understanding is that CSA only stores 1/4 of MLA KV cache, so per token per layer is 288 bytes and HCA only stores 1/128 of MLA KV cache, so per token per layer is 9 bytes. Therefore, KV cache at (288*30+9*31)*1024*1024 =~ 8.70996GiB. So KV cache saving is 7.879x not 9.5x.

For DSV4 Flash, the first two layers are Sliding Window Attention with a window size of 128 tokens. Normally, for these two layers, the per layer KV cache for any length longer than 128 should be 2*n_head_kv*head_dim*precision*window = 2*1*128*2*128 = 65536 bytes. The current llama.cpp implementation adds 256 byes to the window for better batching, it becomes 2*1*128*2*(128+256) = 196608 bytes.

There are 21 CSA layers and 20 HCA layers for DSV4 Flash, so the KV cache at 1m context is (288*21+9*20)*1024*1024+2*196608 = 6.0824GiB. This is 11.3x saving compare to DSV3.2 not 13.7x as claimed.

submitted by /u/Ok_Warning2146
[link] [comments]