Gemma 4 is a KV_cache Pig

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The post discusses Gemma 4’s unusually large KV cache footprint in dense-model attention, claiming it can be 3x or more than other models.
  • It attributes much of the memory usage to design choices such as using a 256 head dimension rather than 128.
  • The author reports an estimated KV cache size of about 490KB per 8-bit token (vs ~128KB for Qwen3) and observes practical limits like ~115k tokens on an RTX Pro 6000 with 96GB RAM using 4-bit weights and 8-bit KV cache.
  • Despite the high KV-cache cost, the model reportedly scales well with vLLM and still delivers strong intelligence for local inference.

Ignoring the 8 bit size of Nvidia’s marketed 4 bit quantization of the dense model…

The dense model KV cache architecture uses 3x or more the memory than what I have seen with other models. It seems like the big choice was 256 head dim instead of 128.

I am looking at 490KB per 8 bit token of KV cache versus 128KB on Qwen3.

I am running the nvidia weights at 4 bit on an rtx pro 6000 with 96GB of RAM and 8 bit kv cache and still only have room for 115k tokens.

I was surprised is all. The model scales well in vllm and seems quite smart.

submitted by /u/IngeniousIdiocy
[link] [comments]