Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post asks whether Google’s TurboQuant (TQ) can be used today to quantize the KV cache in llama-server/llama.cpp, not just the model weights.
It notes that a prior attempt reportedly quantized model weights with TQ (e.g., Qwen3.5-27B) to reach near Q4_0 quality while reducing size by about 10%, enabling use on a 16GB GPU.
The author highlights that context length is the main VRAM bottleneck and therefore KV-cache quantization would be especially valuable.
The key uncertainty is whether llama.cpp already supports TQ for KV cache via a CLI flag similar to existing cache-type options, or whether users must wait for an official PR/release.
The request is for community testing results and the current development status of KV-cache support for TurboQuant in llama.cpp.

Hey everyone,

Ever since the day Google announced TurboQuant, I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the discussions, I'm honestly still a bit confused: is it actually applicable for us right now? And if so, how?

I recently saw an article/post where someone applied this TQ quantization directly to the model weights. They managed to get Qwen3.5-27B running at near-Q4_0 quality, making it about 10% smaller, which finally allowed it to fit comfortably on a 16GB card (specifically an RTX 5060 Ti). This is huge for us with consumer GPUs.

However, since TurboQuant was initially heavily pitched for its efficiency with context and memory, my main question is about the KV Cache.

As we know, context length is the real VRAM killer. So my doubts are:

Can we currently apply TQ quantization to the KV cache when using llama-server (llama.cpp)?
If yes, how do we enable it? Is there already a CLI flag similar to --cache-type q4_0 / --cache-type q8_0?
Or is this strictly limited to model weights right now, and we are still waiting for an official PR/release from the llama.cpp team to implement TQ for the KV cache?

I'd love to hear if anyone has tested this or knows the current development status. Thanks!

submitted by /u/DjsantiX
[link] [comments]

Black Hat USA

AI Business

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Key Points

Related Articles

Black Hat USA

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer