Implemented TurboQuant in Python over weekend

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The author implemented the paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” in Python over a weekend and shared a corresponding GitHub repo.
TurboQuant avoids calibration and training by applying a random rotation to vectors so their coordinates become well-behaved for optimal 1D quantization per dimension.
The approach includes a correction for inner products, using a 1-bit Johnson–Lindenstrauss-style residual to reduce bias at low bit rates.
The article argues the method is especially practical for transformer KV caches (online/streaming quantization) and for vector databases/embeddings where vectors can be compressed independently.
The implementation notes highlight that the random rotation is computationally expensive (O(d^3)) and that the author implemented in NumPy without fractional-bit channel splitting.

Spent ~2 days implementing this paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Repo: github.com/yashkc2025/turboquant

Most quantization stuff I’ve worked with usually falls into one of these:

you need calibration data (k-means, clipping ranges, etc.)
or you go naive (uniform quant) and take the quality hit

This paper basically says: what if we just… don’t do either?

The main idea is weirdly simple:

take your vector
hit it with a random rotation
now suddenly the coordinates behave nicely (like ~Gaussian-ish)
so you can just do optimal 1D quantization per dimension

No training. No dataset-specific tuning. Same quantizer works everywhere.

There’s also a nice fix for inner products:

normal MSE quantization biases dot products (pretty badly at low bits)

so they add a 1-bit JL-style correction on the residual -> makes it unbiased

Why this is actually useful:

KV cache in transformers you can’t calibrate because tokens stream in -> this works online
vector DBs / embeddings compress each vector independently, no preprocessing step

What surprised me:

the rotation step is doing all the magic
after that, everything reduces to a solved 1D problem
theory is tight: within ~2.7× of the optimal distortion bound

My implementation notes:

works pretty cleanly in numpy
rotation is expensive (O(d³))
didn’t implement fractional bits (paper does 2.5 / 3.5-bit with channel splitting)

submitted by /u/chhed_wala_kaccha
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Black Hat Asia

AI Business

EZRide Intel — I Built an AI Assistant for Boston's Hidden Free Bus Using Notion MCP

Dev.to

Booting Robikatsu — Day 0 Rebuilding my life while building an AI startup operating system

Dev.to

Notion Newsroom AI

Dev.to

What Is AI Execution Risk? Why AI Governance Fails at the Execution Boundary

Dev.to

Implemented TurboQuant in Python over weekend

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

EZRide Intel — I Built an AI Assistant for Boston's Hidden Free Bus Using Notion MCP

Booting Robikatsu — Day 0 Rebuilding my life while building an AI startup operating system

Notion Newsroom AI

What Is AI Execution Risk? Why AI Governance Fails at the Execution Boundary

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer