[P] Implemented TurboQuant in Python

Reddit r/MachineLearning / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post describes an implementation in Python of the paper “TurboQuant,” which performs online vector quantization without calibration data or dataset-specific tuning.
  • TurboQuant’s core method is to apply a random rotation to vectors so their coordinates become well-behaved (approximately Gaussian), enabling near-optimal per-dimension 1D quantization.
  • It also addresses inner-product distortion by adding a 1-bit JL-style correction on the quantization residual to reduce bias at low bit rates.
  • The author highlights practical motivation for settings like transformer KV caches (which can’t be calibrated because tokens arrive online) and vector databases/embeddings (which compress vectors independently).
  • The implementation notes report clean NumPy integration but flag that the random rotation is computationally expensive (O(d^3)), and that the author did not implement the paper’s fractional-bit/channel-splitting variants.

Spent ~2 days implementing this paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Repo: github.com/yashkc2025/turboquant

Most quantization stuff I’ve worked with usually falls into one of these:

  • you need calibration data (k-means, clipping ranges, etc.)
  • or you go naive (uniform quant) and take the quality hit

This paper basically says: what if we just… don’t do either?

The main idea is weirdly simple:

  • take your vector
  • hit it with a random rotation
  • now suddenly the coordinates behave nicely (like ~Gaussian-ish)
  • so you can just do optimal 1D quantization per dimension

No training. No dataset-specific tuning. Same quantizer works everywhere.

There’s also a nice fix for inner products:

normal MSE quantization biases dot products (pretty badly at low bits)

so they add a 1-bit JL-style correction on the residual -> makes it unbiased

Why this is actually useful:

  • KV cache in transformers you can’t calibrate because tokens stream in -> this works online
  • vector DBs / embeddings compress each vector independently, no preprocessing step

What surprised me:

  • the rotation step is doing all the magic
  • after that, everything reduces to a solved 1D problem
  • theory is tight: within ~2.7× of the optimal distortion bound

My implementation notes:

  • works pretty cleanly in numpy
  • rotation is expensive (O(d³))
  • didn’t implement fractional bits (paper does 2.5 / 3.5-bit with channel splitting)
submitted by /u/chhed_wala_kaccha
[link] [comments]