Implemented TurboQuant and results don’t fully match paper

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Read original →

共有:

Key Points

The author implemented TurboQuant (arXiv:2504.19874) from scratch and found that their results do not fully replicate the paper, especially for the “PROD” variant.
While the MSE-based version achieves compression and distortion behavior broadly as expected, the PROD version reports in the paper exceed 99% correlation, but the author observed about 95.8% correlation at 4-bit.
More critically, even with ~95% correlation, attention quality degrades noticeably, dropping to roughly 67% top-1 accuracy in a simple simulation.
The author hypothesizes that correlation does not guarantee ranking preservation and that attention is highly sensitive to even small order errors.
Implementation details—such as correct variance scaling, re-deriving QJL variance scaling, and requiring bit packing for compression—were major practical hurdles, and the author asks for feedback from others familiar with KV cache quantization.

I attempted to implement TurboQuant (arXiv:2504.19874) from scratch over the last few days.

Thought I would check something with folks here since my numbers do not match those in the paper.

Observations:

MSE version performs well (compression & distortion as expected)

PROD version:

claims in paper exceed 99% correlation

my number sits around 95.8% at 4-bit

But what’s more interesting:

even at this ~95% correlation level, attention quality degrades significantly

(only ~67% top-1 accuracy on a simple simulation)

My hypothesis:

correlation != ranking preservation

attention is highly sensitive to any order error

Other things I ran into:

variance scaling (unit vs 1/d) initially killed the MSE variant

QJL variance scaling had to be re-derived

bit packing is required for compression to work

Not sure if:

I am simply missing something in the PROD scaling

this is expected behavior when d=256

or paper results depend on larger dimensions / setup

The code is here if anyone is interested in taking a look:

https://github.com/Ashx098/Turboquant-Implementation

Would really appreciate feedback from anyone who has worked on KV cache quantization / similar techniques.

submitted by /u/Routine-Thanks-572
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/3DailyView insight →

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Panduan Lengkap TestSprite MCP Server: Dari Instalasi hingga Pengujian Pertama

Dev.to

Accelerating CNN inference on FPGAs: A Survey

Dev.to

Pudgy Penguins'de AI Tabanlı Tokenomik Etkileşimlerin DeFi Uygulamaları

Dev.to

AI-generated actors and scripts are now ineligible for Oscars

TechCrunch

Implemented TurboQuant and results don’t fully match paper

Key Points

💡 Insights using this article

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Panduan Lengkap TestSprite MCP Server: Dari Instalasi hingga Pengujian Pertama

Accelerating CNN inference on FPGAs: A Survey

Pudgy Penguins'de AI Tabanlı Tokenomik Etkileşimlerin DeFi Uygulamaları

AI-generated actors and scripts are now ineligible for Oscars

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer