AI Navigate

Qwen 3.5 27B - quantize KV cache or not?

Reddit r/LocalLLaMA / 3/20/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The post discusses the tradeoffs between weight quantization and KV cache quantization for the Qwen 3.5 27B model family, noting mixed guidance online.
  • Some sources suggest that quantizing K or V caches to q8 may not significantly harm the model's architecture.
  • The author is currently using ~6k weight quantization with bf16 KV cache, achieving an approximate 80k context window, and notes documentation recommending not going below 128k.
  • The author is weighing whether to switch to q4 weight quantization or q8 KV cache to exceed the 128k context window.
  • The discussion highlights practical considerations for deploying larger-context LLMs and balancing quantization choices with performance and context length.

I’m getting mixed answers on the tradeoff between weight quantization and/or KV cache quantization with the qwen 3.5 model family.

I’m some sources I read that the architecture of this model is not really negatively affected by a q8 K or V cache quantization.

I’m currently running q 6k weights with bf16 Kav cache. It fits on my GPU with around 80k context window. Apparently the documentation suggests not going lower than 128k context window.

I’m trying to judge the tradeoff between going to q4 weights or q8 KV, either of which would get me to above 128 context window.

Thanks!

submitted by /u/Spicy_mch4ggis
[link] [comments]