I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Dev.to / 3/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Google’s TurboQuant paper (ICLR 2026) proposes compressing transformer KV caches to 4 bits per coordinate with claimed zero accuracy loss, reducing H100 memory usage by about 5–6× on text models like Gemma and Mistral.
  • The author implemented TurboQuant as a vLLM plugin (“turboquant-vllm”) within 72 hours and released it on PyPI, enabling use via vLLM serve without code changes or vLLM forking.
  • Unlike prior compression work focused on token pruning, this article tests TurboQuant in a vision-language, video setting where visual token counts can reach ~11,000 tokens, creating substantially larger KV caches.
  • On an RTX 4090 with Molmo2-4B and ~11K visual tokens, KV cache size drops from 1,639 MiB to 435 MiB (about 3.76×) while maintaining output quality according to the reported comparison.
  • The article positions TurboQuant as complementary to token-pruning approaches and suggests 4-bit KV compression remains viable under the higher token and longer-context pressures of VLM video workloads.

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral.

I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?

72 hours later, turboquant-vllm is on PyPI.

Quick Start

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching.

For HuggingFace users:

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Pass cache (not wrapper) to model.generate()

Why Vision-Language Models Matter

Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second video clip through Molmo2-4B produces ~11,000 visual tokens — 1.6 GB of KV cache on a 24 GB GPU.

That's 10x more memory, 10x more opportunities for precision bugs to compound across 36 transformer layers. The existing VLM compression literature (VL-Cache, Dynamic-LLaVA, ZipVL) is all token pruning — deciding which tokens to discard. TurboQuant compresses the tokens you keep. They're complementary approaches, and nobody had validated whether vector quantization survives the visual token regime.

It does.

Results

Molmo2-4B on RTX 4090, 11K visual tokens from a Seinfeld video clip:

Metric Baseline TQ4 Compressed
KV cache 1,639 MiB 435 MiB (3.76x)
Output quality Detailed scene description Near-identical (100+ tokens match)
Decode overhead 1.78x

Molmo2-8B: same 3.76x ratio, correctly identifies all Seinfeld characters. Full 23-minute episode processed at 24 tok/s.

What I Built Differently

Plugin, not fork

Other vLLM TurboQuant efforts are forks or monkey-patches. turboquant-vllm uses vLLM's official plugin entry point:

[project.entry-points."vllm.general_plugins"]
tq4_backend = "turboquant_vllm.vllm:register_tq4_backend"

Incremental dequantization

The naive approach decompresses the full KV cache at every layer, every step — 3.36x overhead. Incremental dequantization decompresses only the 1 new token per step and appends to a running buffer. Overhead drops to 1.78x. This isn't in Google's paper.

Cross-platform Triton

Fused kernels run on both NVIDIA CUDA and AMD ROCm without code changes. 84/84 GPU tests pass on a Radeon 890M iGPU.

Bugs Nobody Else Has Found

  1. FP16 norms fail at scale. Works at 11,385 tokens, garbles output at 11,397 tokens. The 0.01% error per vector compounds across 36 layers. Always use fp32.

  2. QJL correction is invisible in standard attention. The paper's Stage 2 (2-bit MSE + 1-bit QJL) wastes 1 bit of precision — standard Q @ K^T can't use the correction. Full 3-bit MSE produces identical output.

  3. Multi-layer precision drift in fused kernels. A 0.023 cosine gap per layer between fp32 Triton and bf16 SDPA compounds to produce "pizza pizza pizza" at 36 layers. Flash Attention-style fusion needed.

Validation

  • 180+ tests, 9 test files, 95%+ coverage
  • 16 GPU experiments with documented failures
  • Cross-platform: NVIDIA RTX 4090 + AMD Radeon 890M
  • End-to-end: installed from PyPI into stock vllm/vllm-openai:latest container

What's Next

  • Upstream contribution to vLLM (issue #38171, 49 upvotes)
  • Full Flash Attention fusion for the fused Triton kernels
  • Stacking with VL-Cache-style token pruning for multiplicative VLM savings

PyPI | Docs | GitHub

広告