Kv cache quantization: ignorance, or malice?

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

A user running Qwen-3.6 27B FP8 on vLLM with long-context, multi-agent coding workloads reports that KV cache quantization (e.g., to q8) causes subtle but serious failures such as tool-calling issues and degraded reasoning.
They claim that keeping the KV cache at 16-bit significantly improves reliability and performance balance compared with quantized settings.
The user questions why KV cache quantization is discussed as a serious solution at all, arguing that it seems unacceptable for “real” or high-stakes tasks versus only low-stakes chatbot scenarios.
They also mention TurboQuant, suggesting that it may introduce an intelligence/performance hit, and ask whether their understanding of the trade-offs is correct.
Overall, the post is a contention/clarification request from a newcomer who has a software engineering background, seeking correction or better explanation of when KV quantization is appropriate.

I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between speed and reliability. I want to bring up a particular point of contention regarding this optimization process. I have extensive software engineering background but am relatively new to this so feel free to correct me if I’m not on the right track.

It seems like conventional wisdom is that you shouldn’t quantize kv cache. In my experience, with my specific workloads, that remains true: at q8, I see many subtle mistakes, tool calling issues, and just plain bad reasoning. The performance is dramatically higher when I pin it at 16 bit.

So with that in mind why do I keep seeing people gesturing at this like it’s a serious solution? I guess I can see it if it’d just low stakes chatbot stuff. But why would anyone run anything serious at anything less than full sized kv? I keep seeing stuff about turboquant as well and haven’t tried it but from what I understood, it seems like it comes with an intelligence hit too.

So am I understanding correctly?

submitted by /u/wombweed
[link] [comments]