Tested TurboQuant KV compression with Gemma 4 31B — 5.80x compression, perfect long-context recall, JSON output preserved

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A quick implementation of Google Research’s TurboQuant paper was tested with Google’s newly used Gemma 4 31B model, achieving 5.80x KV-cache compression (37.62MB vs 218.23MB on the author’s FP16 baseline of 222 tokens).
  • The observed compression outperformed the TurboQuant turbo3 theoretical claim (4.9x), which the author attributes to Gemma 4 having fewer outliers in K/V distributions than the synthetic data used in the paper.
  • Functional tests showed “perfect” multi-turn long-context fact recall after a full compress/decompress KV cycle, including a needle-in-haystack retrieval test (~500 tokens) that still returned the hidden phrase.
  • The author reports that agentic JSON extraction remained valid JSON after KV compression, suggesting structured-output capabilities are preserved through the TurboQuant KV-cache bridge.
  • A practical engineering gotcha emerged: Gemma 4 uses heterogeneous KV shapes across layers, requiring per-layer quantizer instances and runtime shape detection to avoid crashes in custom past_key_values interception.

Quick experiment: I implemented Google Research's TurboQuant paper (arXiv 2504.19874) as a Python package and tested it with Google's brand new Gemma 4 31B model. The results exceeded the paper's claims.

Setup:

  • Hardware: RTX PRO 6000 Blackwell (96GB VRAM)
  • Model: google/gemma-4-31B-it (BF16, 64 layers)
  • Compression: TurboQuant turbo3 (3-bit PolarQuant + QJL residual)
  • Backend: HuggingFace Transformers with custom past_key_values interception

Compression results:

Metric Value
FP16 baseline (222 tokens) 218.23 MB
TurboQuant compressed 37.62 MB
Compression ratio 5.80x
Paper theoretical (turbo3) 4.9x

The 5.80x exceeds the paper's 4.9x claim — likely because Gemma 4's architecture has fewer outliers in the K/V distributions than the synthetic data used in the paper.

Functional validation:

  1. Multi-turn fact recall (the critical test): Told Gemma 4 "the secret project code is PHOENIX-42", then asked in turn 2 "what is the secret code?" — after a full TurboQuant compress/decompress cycle of the KV cache. Response: "PHOENIX-42." Perfect.
  2. Long-context needle-in-haystack (~500 tokens): Hid "BLUE-DOLPHIN-7" inside filler text, asked to retrieve it. Response: "The activation phrase mentioned above is BLUE-DOLPHIN-7." Compression during this test: 5.80x.
  3. Agentic JSON output: Asked Gemma 4 to extract entities into structured JSON ({people, places, organizations}) — Gemma 4's signature agentic feature. The structured output remained valid JSON after KV compression. The bridge preserves the model's ability to produce structured output.

Heterogeneous KV shapes:

One interesting finding — Gemma 4 has interleaved attention with different KV head counts across layers (similar to Gemma 3's local/global pattern). My first implementation crashed because it assumed uniform shapes. The fix was per-layer quantizer instances with shape detection at runtime via a dummy forward pass. If anyone else is implementing custom KV cache logic for Gemma 4, watch out for this.

How it works (technical):

The paper specifies two algorithms:

  • Algorithm 1 (PolarQuant): Random rotation → Lloyd-Max scalar quantization on Beta-distributed coordinates → inverse rotation. MSE-optimal.
  • Algorithm 2 (QJL residual): Apply Algorithm 1 at b-1 bits, then encode residual with sign(S·r). Inner-product unbiased.

K-cache uses Algorithm 1 (minimizes ||K - K_hat||² → optimal for attention scores Q·K^T). V-cache uses Algorithm 2 (unbiased inner-product → optimal for attention output attn_weights · V).

The native turboquant-kv library handles the math but stores indices as int64 and signs as float32 — defeating the compression. I added a bit-packing layer on top: 2-bit indices + 1-bit signs packed into bytes, plus float16 norms/gammas. That's where the real 5.80x comes from.

One-line setup:

pip install turboagent-ai[torch,native] from turboagent import TurboAgent agent = TurboAgent("google/gemma-4-31B-it") response = agent.run("Your prompt here") 

The package handles backend selection, hardware detection, KV bridge plumbing, and the bit-packing automatically.

Links:

MIT licensed. Honest feedback welcome — particularly interested in hearing if anyone has tried this on other recent models or spotted issues with the bit-packing approach.

submitted by /u/No_Appearance_3041
[link] [comments]