Hey everyone,
Ever since the day Google announced TurboQuant, I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the discussions, I'm honestly still a bit confused: is it actually applicable for us right now? And if so, how?
I recently saw an article/post where someone applied this TQ quantization directly to the model weights. They managed to get Qwen3.5-27B running at near-Q4_0 quality, making it about 10% smaller, which finally allowed it to fit comfortably on a 16GB card (specifically an RTX 5060 Ti). This is huge for us with consumer GPUs.
However, since TurboQuant was initially heavily pitched for its efficiency with context and memory, my main question is about the KV Cache.
As we know, context length is the real VRAM killer. So my doubts are:
- Can we currently apply TQ quantization to the KV cache when using llama-server (llama.cpp)?
- If yes, how do we enable it? Is there already a CLI flag similar to --cache-type q4_0 / --cache-type q8_0?
- Or is this strictly limited to model weights right now, and we are still waiting for an official PR/release from the llama.cpp team to implement TQ for the KV cache?
I'd love to hear if anyone has tested this or knows the current development status. Thanks!
[link] [comments]

