I attempted to implement TurboQuant (arXiv:2504.19874) from scratch over the last few days.
Thought I would check something with folks here since my numbers do not match those in the paper.
Observations:
MSE version performs well (compression & distortion as expected)
PROD version:
claims in paper exceed 99% correlation
my number sits around 95.8% at 4-bit
But what’s more interesting:
even at this ~95% correlation level, attention quality degrades significantly
(only ~67% top-1 accuracy on a simple simulation)
My hypothesis:
correlation != ranking preservation
attention is highly sensitive to any order error
Other things I ran into:
variance scaling (unit vs 1/d) initially killed the MSE variant
QJL variance scaling had to be re-derived
bit packing is required for compression to work
Not sure if:
I am simply missing something in the PROD scaling
this is expected behavior when d=256
or paper results depend on larger dimensions / setup
The code is here if anyone is interested in taking a look:
https://github.com/Ashx098/Turboquant-Implementation
Would really appreciate feedback from anyone who has worked on KV cache quantization / similar techniques.
[link] [comments]




