Tested TurboQuant KV compression with Gemma 4 31B — 5.80x compression, perfect long-context recall, JSON output preserved

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A quick implementation of Google Research’s TurboQuant paper was tested with Google’s newly used Gemma 4 31B model, achieving 5.80x KV-cache compression (37.62MB vs 218.23MB on the author’s FP16 baseline of 222 tokens).
The observed compression outperformed the TurboQuant turbo3 theoretical claim (4.9x), which the author attributes to Gemma 4 having fewer outliers in K/V distributions than the synthetic data used in the paper.
Functional tests showed “perfect” multi-turn long-context fact recall after a full compress/decompress KV cycle, including a needle-in-haystack retrieval test (~500 tokens) that still returned the hidden phrase.
The author reports that agentic JSON extraction remained valid JSON after KV compression, suggesting structured-output capabilities are preserved through the TurboQuant KV-cache bridge.
A practical engineering gotcha emerged: Gemma 4 uses heterogeneous KV shapes across layers, requiring per-layer quantizer instances and runtime shape detection to avoid crashes in custom past_key_values interception.

Quick experiment: I implemented Google Research's TurboQuant paper (arXiv 2504.19874) as a Python package and tested it with Google's brand new Gemma 4 31B model. The results exceeded the paper's claims.

Setup:

Hardware: RTX PRO 6000 Blackwell (96GB VRAM)
Model: google/gemma-4-31B-it (BF16, 64 layers)
Compression: TurboQuant turbo3 (3-bit PolarQuant + QJL residual)
Backend: HuggingFace Transformers with custom past_key_values interception

Compression results:

Metric	Value
FP16 baseline (222 tokens)	218.23 MB
TurboQuant compressed	37.62 MB
Compression ratio	5.80x
Paper theoretical (turbo3)	4.9x

The 5.80x exceeds the paper's 4.9x claim — likely because Gemma 4's architecture has fewer outliers in the K/V distributions than the synthetic data used in the paper.

Functional validation:

Multi-turn fact recall (the critical test): Told Gemma 4 "the secret project code is PHOENIX-42", then asked in turn 2 "what is the secret code?" — after a full TurboQuant compress/decompress cycle of the KV cache. Response: "PHOENIX-42." Perfect.
Long-context needle-in-haystack (~500 tokens): Hid "BLUE-DOLPHIN-7" inside filler text, asked to retrieve it. Response: "The activation phrase mentioned above is BLUE-DOLPHIN-7." Compression during this test: 5.80x.
Agentic JSON output: Asked Gemma 4 to extract entities into structured JSON ({people, places, organizations}) — Gemma 4's signature agentic feature. The structured output remained valid JSON after KV compression. The bridge preserves the model's ability to produce structured output.

Heterogeneous KV shapes:

One interesting finding — Gemma 4 has interleaved attention with different KV head counts across layers (similar to Gemma 3's local/global pattern). My first implementation crashed because it assumed uniform shapes. The fix was per-layer quantizer instances with shape detection at runtime via a dummy forward pass. If anyone else is implementing custom KV cache logic for Gemma 4, watch out for this.

How it works (technical):

The paper specifies two algorithms:

Algorithm 1 (PolarQuant): Random rotation → Lloyd-Max scalar quantization on Beta-distributed coordinates → inverse rotation. MSE-optimal.
Algorithm 2 (QJL residual): Apply Algorithm 1 at b-1 bits, then encode residual with sign(S·r). Inner-product unbiased.

K-cache uses Algorithm 1 (minimizes ||K - K_hat||² → optimal for attention scores Q·K^T). V-cache uses Algorithm 2 (unbiased inner-product → optimal for attention output attn_weights · V).

The native turboquant-kv library handles the math but stores indices as int64 and signs as float32 — defeating the compression. I added a bit-packing layer on top: 2-bit indices + 1-bit signs packed into bytes, plus float16 norms/gammas. That's where the real 5.80x comes from.

One-line setup:

pip install turboagent-ai[torch,native] from turboagent import TurboAgent agent = TurboAgent("google/gemma-4-31B-it") response = agent.run("Your prompt here")

The package handles backend selection, hardware detection, KV bridge plumbing, and the bit-packing automatically.

Links:

Code: https://github.com/TurboAgentAI/turboagent
Paper: https://arxiv.org/abs/2504.19874
Site: https://turboagent.to

MIT licensed. Honest feedback welcome — particularly interested in hearing if anyone has tried this on other recent models or spotted issues with the bit-packing approach.

submitted by /u/No_Appearance_3041
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Reddit r/MachineLearning

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Dev.to

Melhores Alternativas ao NightCafe em 2026: Acesso API, Recursos Empresariais, Menores Custos

Dev.to

Tested TurboQuant KV compression with Gemma 4 31B — 5.80x compression, perfect long-context recall, JSON output preserved

Key Points

Related Articles

Black Hat USA

Black Hat Asia

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Melhores Alternativas ao NightCafe em 2026: Acesso API, Recursos Empresariais, Menores Custos

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer