I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Dev.to / 3/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

Google’s TurboQuant paper (ICLR 2026) proposes compressing transformer KV caches to 4 bits per coordinate with claimed zero accuracy loss, reducing H100 memory usage by about 5–6× on text models like Gemma and Mistral.
The author implemented TurboQuant as a vLLM plugin (“turboquant-vllm”) within 72 hours and released it on PyPI, enabling use via vLLM serve without code changes or vLLM forking.
Unlike prior compression work focused on token pruning, this article tests TurboQuant in a vision-language, video setting where visual token counts can reach ~11,000 tokens, creating substantially larger KV caches.
On an RTX 4090 with Molmo2-4B and ~11K visual tokens, KV cache size drops from 1,639 MiB to 435 MiB (about 3.76×) while maintaining output quality according to the reported comparison.
The article positions TurboQuant as complementary to token-pruning approaches and suggests 4-bit KV compression remains viable under the higher token and longer-context pressures of VLM video workloads.

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral.

I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?

72 hours later, turboquant-vllm is on PyPI.

Quick Start

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching.

For HuggingFace users:

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Pass cache (not wrapper) to model.generate()

Why Vision-Language Models Matter

Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second video clip through Molmo2-4B produces ~11,000 visual tokens — 1.6 GB of KV cache on a 24 GB GPU.

That's 10x more memory, 10x more opportunities for precision bugs to compound across 36 transformer layers. The existing VLM compression literature (VL-Cache, Dynamic-LLaVA, ZipVL) is all token pruning — deciding which tokens to discard. TurboQuant compresses the tokens you keep. They're complementary approaches, and nobody had validated whether vector quantization survives the visual token regime.

It does.

Results

Molmo2-4B on RTX 4090, 11K visual tokens from a Seinfeld video clip:

Metric	Baseline	TQ4 Compressed
KV cache	1,639 MiB	435 MiB (3.76x)
Output quality	Detailed scene description	Near-identical (100+ tokens match)
Decode overhead	—	1.78x

Molmo2-8B: same 3.76x ratio, correctly identifies all Seinfeld characters. Full 23-minute episode processed at 24 tok/s.

What I Built Differently

Plugin, not fork

Other vLLM TurboQuant efforts are forks or monkey-patches. turboquant-vllm uses vLLM's official plugin entry point:

[project.entry-points."vllm.general_plugins"]
tq4_backend = "turboquant_vllm.vllm:register_tq4_backend"

Incremental dequantization

The naive approach decompresses the full KV cache at every layer, every step — 3.36x overhead. Incremental dequantization decompresses only the 1 new token per step and appends to a running buffer. Overhead drops to 1.78x. This isn't in Google's paper.

Cross-platform Triton

Fused kernels run on both NVIDIA CUDA and AMD ROCm without code changes. 84/84 GPU tests pass on a Radeon 890M iGPU.

Bugs Nobody Else Has Found

FP16 norms fail at scale. Works at 11,385 tokens, garbles output at 11,397 tokens. The 0.01% error per vector compounds across 36 layers. Always use fp32.
QJL correction is invisible in standard attention. The paper's Stage 2 (2-bit MSE + 1-bit QJL) wastes 1 bit of precision — standard Q @ K^T can't use the correction. Full 3-bit MSE produces identical output.
Multi-layer precision drift in fused kernels. A 0.023 cosine gap per layer between fp32 Triton and bf16 SDPA compounds to produce "pizza pizza pizza" at 36 layers. Flash Attention-style fusion needed.

Validation

180+ tests, 9 test files, 95%+ coverage
16 GPU experiments with documented failures
Cross-platform: NVIDIA RTX 4090 + AMD Radeon 890M
End-to-end: installed from PyPI into stock vllm/vllm-openai:latest container

What's Next

Upstream contribution to vLLM (issue #38171, 49 upvotes)
Full Flash Attention fusion for the fused Triton kernels
Stacking with VL-Cache-style token pruning for multiplicative VLM savings

PyPI | Docs | GitHub

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/28DailyView insight →

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

AI-Powered E-Commerce: Automating Product Descriptions at Scale

Dev.to

The Best Free AI Tools I Actually Use Every Day

Dev.to

My 8 Agents Wrote Perfect Components - And Nothing Worked

Dev.to

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Key Points

Quick Start

Why Vision-Language Models Matter

Results

What I Built Differently

Plugin, not fork

Incremental dequantization

Cross-platform Triton

Bugs Nobody Else Has Found

Validation

What's Next

💡 Insights using this article

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI-Powered E-Commerce: Automating Product Descriptions at Scale

The Best Free AI Tools I Actually Use Every Day

My 8 Agents Wrote Perfect Components - And Nothing Worked

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer