Implemented TurboQuant and results don’t fully match paper

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The author implemented TurboQuant (arXiv:2504.19874) from scratch and found that their results do not fully replicate the paper, especially for the “PROD” variant.
  • While the MSE-based version achieves compression and distortion behavior broadly as expected, the PROD version reports in the paper exceed 99% correlation, but the author observed about 95.8% correlation at 4-bit.
  • More critically, even with ~95% correlation, attention quality degrades noticeably, dropping to roughly 67% top-1 accuracy in a simple simulation.
  • The author hypothesizes that correlation does not guarantee ranking preservation and that attention is highly sensitive to even small order errors.
  • Implementation details—such as correct variance scaling, re-deriving QJL variance scaling, and requiring bit packing for compression—were major practical hurdles, and the author asks for feedback from others familiar with KV cache quantization.

I attempted to implement TurboQuant (arXiv:2504.19874) from scratch over the last few days.

Thought I would check something with folks here since my numbers do not match those in the paper.

Observations:

MSE version performs well (compression & distortion as expected)

PROD version:

claims in paper exceed 99% correlation

my number sits around 95.8% at 4-bit

But what’s more interesting:

even at this ~95% correlation level, attention quality degrades significantly

(only ~67% top-1 accuracy on a simple simulation)

My hypothesis:

correlation != ranking preservation

attention is highly sensitive to any order error

Other things I ran into:

variance scaling (unit vs 1/d) initially killed the MSE variant

QJL variance scaling had to be re-derived

bit packing is required for compression to work

Not sure if:

I am simply missing something in the PROD scaling

this is expected behavior when d=256

or paper results depend on larger dimensions / setup

The code is here if anyone is interested in taking a look:

https://github.com/Ashx098/Turboquant-Implementation

Would really appreciate feedback from anyone who has worked on KV cache quantization / similar techniques.

submitted by /u/Routine-Thanks-572
[link] [comments]