I’m getting mixed answers on the tradeoff between weight quantization and/or KV cache quantization with the qwen 3.5 model family.
I’m some sources I read that the architecture of this model is not really negatively affected by a q8 K or V cache quantization.
I’m currently running q 6k weights with bf16 Kav cache. It fits on my GPU with around 80k context window. Apparently the documentation suggests not going lower than 128k context window.
I’m trying to judge the tradeoff between going to q4 weights or q8 KV, either of which would get me to above 128 context window.
Thanks!
[link] [comments]