[P] Implemented TurboQuant in Python

Reddit r/MachineLearning / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The post describes an implementation in Python of the paper “TurboQuant,” which performs online vector quantization without calibration data or dataset-specific tuning.
TurboQuant’s core method is to apply a random rotation to vectors so their coordinates become well-behaved (approximately Gaussian), enabling near-optimal per-dimension 1D quantization.
It also addresses inner-product distortion by adding a 1-bit JL-style correction on the quantization residual to reduce bias at low bit rates.
The author highlights practical motivation for settings like transformer KV caches (which can’t be calibrated because tokens arrive online) and vector databases/embeddings (which compress vectors independently).
The implementation notes report clean NumPy integration but flag that the random rotation is computationally expensive (O(d^3)), and that the author did not implement the paper’s fractional-bit/channel-splitting variants.

Spent ~2 days implementing this paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Repo: github.com/yashkc2025/turboquant

Most quantization stuff I’ve worked with usually falls into one of these:

you need calibration data (k-means, clipping ranges, etc.)
or you go naive (uniform quant) and take the quality hit

This paper basically says: what if we just… don’t do either?

The main idea is weirdly simple:

take your vector
hit it with a random rotation
now suddenly the coordinates behave nicely (like ~Gaussian-ish)
so you can just do optimal 1D quantization per dimension

No training. No dataset-specific tuning. Same quantizer works everywhere.

There’s also a nice fix for inner products:

normal MSE quantization biases dot products (pretty badly at low bits)

so they add a 1-bit JL-style correction on the residual -> makes it unbiased

Why this is actually useful:

KV cache in transformers you can’t calibrate because tokens stream in -> this works online
vector DBs / embeddings compress each vector independently, no preprocessing step

What surprised me:

the rotation step is doing all the magic
after that, everything reduces to a solved 1D problem
theory is tight: within ~2.7× of the optimal distortion bound

My implementation notes:

works pretty cleanly in numpy
rotation is expensive (O(d³))
didn’t implement fractional bits (paper does 2.5 / 3.5-bit with channel splitting)

submitted by /u/chhed_wala_kaccha
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Black Hat Asia

AI Business

EZRide Intel — I Built an AI Assistant for Boston's Hidden Free Bus Using Notion MCP

Dev.to

Booting Robikatsu — Day 0 Rebuilding my life while building an AI startup operating system

Dev.to

Notion Newsroom AI

Dev.to

What Is AI Execution Risk? Why AI Governance Fails at the Execution Boundary

Dev.to

[P] Implemented TurboQuant in Python

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

EZRide Intel — I Built an AI Assistant for Boston's Hidden Free Bus Using Notion MCP

Booting Robikatsu — Day 0 Rebuilding my life while building an AI startup operating system

Notion Newsroom AI

What Is AI Execution Risk? Why AI Governance Fails at the Execution Boundary

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer