Gemma 4 is a KV_cache Pig

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The post discusses Gemma 4’s unusually large KV cache footprint in dense-model attention, claiming it can be 3x or more than other models.
It attributes much of the memory usage to design choices such as using a 256 head dimension rather than 128.
The author reports an estimated KV cache size of about 490KB per 8-bit token (vs ~128KB for Qwen3) and observes practical limits like ~115k tokens on an RTX Pro 6000 with 96GB RAM using 4-bit weights and 8-bit KV cache.
Despite the high KV-cache cost, the model reportedly scales well with vLLM and still delivers strong intelligence for local inference.

Ignoring the 8 bit size of Nvidia’s marketed 4 bit quantization of the dense model…

The dense model KV cache architecture uses 3x or more the memory than what I have seen with other models. It seems like the big choice was 256 head dim instead of 128.

I am looking at 490KB per 8 bit token of KV cache versus 128KB on Qwen3.

I am running the nvidia weights at 4 bit on an rtx pro 6000 with 96GB of RAM and 8 bit kv cache and still only have room for 115k tokens.

I was surprised is all. The model scales well in vllm and seems quite smart.

submitted by /u/IngeniousIdiocy
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/4DailyView insight →

Black Hat Asia

AI Business

Claude Code’s Source Leaks, OpenAI Exits Video Generation, Gemini Adds Music Generation, LLMs Learn at Inference

The Batch

OpenAI acquires TBPN

Dev.to

Quoting Willy Tarreau

Simon Willison's Blog

Quoting Daniel Stenberg

Simon Willison's Blog

Gemma 4 is a KV_cache Pig

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

Claude Code’s Source Leaks, OpenAI Exits Video Generation, Gemini Adds Music Generation, LLMs Learn at Inference

OpenAI acquires TBPN

Quoting Willy Tarreau

Quoting Daniel Stenberg

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer