Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant)

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author developing an open-source TurboQuant KV compression implementation in llama.cpp found KV dequantization to be a major bottleneck at long contexts (e.g., ~40% of decode time at 32K on an M5 Max).
  • Instead of optimizing dequant math with LUT/SIMD/fused/branchless approaches, they exploit attention sparsity by skipping V-side dequantization for positions with near-zero softmax weights.
  • This “sparse V dequant” change is described as only a few lines in the kernel and preserves model quality metrics (PPL unchanged; NIAH improved for the tested TurboQuant variant).
  • On Qwen3.5-35B-A3B (M5 Max), TurboQuant KV shows about +22.8% decode speed at 32K versus TurboQuant baseline, while standard q8_0 KV cache sees a smaller but still positive ~+5% speedup.
  • The method is positioned as not TurboQuant-specific, with additional testing on another Apple chip (M2 Pro) and ongoing work/interest in porting and validating on CUDA and other platforms.

I’ve been working on an open source TurboQuant implementation for KV cache compression in llama.cpp and ran into a hard bottleneck: dequantization.

At long context (32K on M5 Max), dequant alone was taking around 40 percent of decode time.

I tried fixing it the usual way: - register LUTs
- SIMD tricks
- fused kernels
- branchless math

Tested about 14 different approaches. None beat the baseline. Hardware was already at the limit.

What ended up working was much simpler.

Flash attention computes softmax weights before touching V.
At long context, most of those weights are basically zero.

So instead of making dequant faster, I just skip V dequant entirely for positions with negligible attention.

It’s about 3 lines in the kernel.

Results on Qwen3.5-35B-A3B (M5 Max):

TurboQuant KV (turbo3): - +22.8% decode at 32K
- PPL unchanged
- NIAH: 7/9 → 9/9

Standard q8_0 KV cache: - +5% decode
- PPL identical
- NIAH identical

So this is not TurboQuant-specific. It’s using attention sparsity directly.

Also tested on M2 Pro: - 4-mag LUT on K side + sparse V stack cleanly
- turbo3 went from ~0.45x → ~0.73x vs q8_0

Repo and benchmarks:
https://github.com/TheTom/turboquant_plus

Writeup:
https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/sparse-v-dequant.md

If anyone wants to try this on CUDA or other setups I’d be interested to see results.

Note: a CUDA port is currently being tested independently. Will share results once available.

submitted by /u/Pidtom
[link] [comments]
広告