Quick experiment: I implemented Google Research's TurboQuant paper (arXiv 2504.19874) as a Python package and tested it with Google's brand new Gemma 4 31B model. The results exceeded the paper's claims.
Setup:
- Hardware: RTX PRO 6000 Blackwell (96GB VRAM)
- Model: google/gemma-4-31B-it (BF16, 64 layers)
- Compression: TurboQuant turbo3 (3-bit PolarQuant + QJL residual)
- Backend: HuggingFace Transformers with custom past_key_values interception
Compression results:
| Metric | Value |
|---|---|
| FP16 baseline (222 tokens) | 218.23 MB |
| TurboQuant compressed | 37.62 MB |
| Compression ratio | 5.80x |
| Paper theoretical (turbo3) | 4.9x |
The 5.80x exceeds the paper's 4.9x claim — likely because Gemma 4's architecture has fewer outliers in the K/V distributions than the synthetic data used in the paper.
Functional validation:
- Multi-turn fact recall (the critical test): Told Gemma 4 "the secret project code is PHOENIX-42", then asked in turn 2 "what is the secret code?" — after a full TurboQuant compress/decompress cycle of the KV cache. Response: "PHOENIX-42." Perfect.
- Long-context needle-in-haystack (~500 tokens): Hid "BLUE-DOLPHIN-7" inside filler text, asked to retrieve it. Response: "The activation phrase mentioned above is BLUE-DOLPHIN-7." Compression during this test: 5.80x.
- Agentic JSON output: Asked Gemma 4 to extract entities into structured JSON ({people, places, organizations}) — Gemma 4's signature agentic feature. The structured output remained valid JSON after KV compression. The bridge preserves the model's ability to produce structured output.
Heterogeneous KV shapes:
One interesting finding — Gemma 4 has interleaved attention with different KV head counts across layers (similar to Gemma 3's local/global pattern). My first implementation crashed because it assumed uniform shapes. The fix was per-layer quantizer instances with shape detection at runtime via a dummy forward pass. If anyone else is implementing custom KV cache logic for Gemma 4, watch out for this.
How it works (technical):
The paper specifies two algorithms:
- Algorithm 1 (PolarQuant): Random rotation → Lloyd-Max scalar quantization on Beta-distributed coordinates → inverse rotation. MSE-optimal.
- Algorithm 2 (QJL residual): Apply Algorithm 1 at b-1 bits, then encode residual with sign(S·r). Inner-product unbiased.
K-cache uses Algorithm 1 (minimizes ||K - K_hat||² → optimal for attention scores Q·K^T). V-cache uses Algorithm 2 (unbiased inner-product → optimal for attention output attn_weights · V).
The native turboquant-kv library handles the math but stores indices as int64 and signs as float32 — defeating the compression. I added a bit-packing layer on top: 2-bit indices + 1-bit signs packed into bytes, plus float16 norms/gammas. That's where the real 5.80x comes from.
One-line setup:
pip install turboagent-ai[torch,native] from turboagent import TurboAgent agent = TurboAgent("google/gemma-4-31B-it") response = agent.run("Your prompt here") The package handles backend selection, hardware detection, KV bridge plumbing, and the bit-packing automatically.
Links:
- Code: https://github.com/TurboAgentAI/turboagent
- Paper: https://arxiv.org/abs/2504.19874
- Site: https://turboagent.to
MIT licensed. Honest feedback welcome — particularly interested in hearing if anyone has tried this on other recent models or spotted issues with the bit-packing approach.
[link] [comments]


