I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between speed and reliability. I want to bring up a particular point of contention regarding this optimization process. I have extensive software engineering background but am relatively new to this so feel free to correct me if I’m not on the right track.
It seems like conventional wisdom is that you shouldn’t quantize kv cache. In my experience, with my specific workloads, that remains true: at q8, I see many subtle mistakes, tool calling issues, and just plain bad reasoning. The performance is dramatically higher when I pin it at 16 bit.
So with that in mind why do I keep seeing people gesturing at this like it’s a serious solution? I guess I can see it if it’d just low stakes chatbot stuff. But why would anyone run anything serious at anything less than full sized kv? I keep seeing stuff about turboquant as well and haven’t tried it but from what I understood, it seems like it comes with an intelligence hit too.
So am I understanding correctly?
[link] [comments]




