My biggest Issue with the Gemma-4 Models is the Massive KV Cache!!

Reddit r/LocalLLaMA / 4/3/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A Reddit user reports that running Unsloth’s Gemma-4 31B (UD-Q8) is difficult on a 40GB VRAM setup because the KV cache becomes too large, forcing aggressive KV quantization to fit even at 2K context.
  • They compare performance and usability, saying they can fit Qwen3.5-27B (UD-Q8) at full context without KV quantization, which they find more practical.
  • The user argues that if Gemma-4 requires Q4 model quantization plus at least Q8 (or lower) KV cache quantization, they would rather use Qwen3.5-27B since it performs better on benchmarks.
  • They end by asking others for their experiences with Gemma-4, implying ongoing community discussion around KV cache size and real-world deployment constraints for local LLMs.

I mean, I have 40GB of Vram and I still cannot fit the entire Unsloth Gemma-4-31B-it-UD-Q8 (35GB) even at 2K context size unless I quantize KV to Q4 with 2K context size? WTF? For comparison, I can fit the entire UD-Q8 Qwen3.5-27B at full context without KV quantization!

If I have to run a Q4 Gemma-4-31B-it-UD with a Q8 KV cache, then I am better off just using Qwen3.5-27B. After all, the latter beats the former in basically all benchmarks.

What's your experience with the Gemma-4 models so far?

submitted by /u/Iory1998
[link] [comments]