I’ve been working on an open source TurboQuant implementation for KV cache compression in llama.cpp and ran into a hard bottleneck: dequantization.
At long context (32K on M5 Max), dequant alone was taking around 40 percent of decode time.
I tried fixing it the usual way: - register LUTs
- SIMD tricks
- fused kernels
- branchless math
Tested about 14 different approaches. None beat the baseline. Hardware was already at the limit.
What ended up working was much simpler.
Flash attention computes softmax weights before touching V.
At long context, most of those weights are basically zero.
So instead of making dequant faster, I just skip V dequant entirely for positions with negligible attention.
It’s about 3 lines in the kernel.
Results on Qwen3.5-35B-A3B (M5 Max):
TurboQuant KV (turbo3): - +22.8% decode at 32K
- PPL unchanged
- NIAH: 7/9 → 9/9
Standard q8_0 KV cache: - +5% decode
- PPL identical
- NIAH identical
So this is not TurboQuant-specific. It’s using attention sparsity directly.
Also tested on M2 Pro: - 4-mag LUT on K side + sparse V stack cleanly
- turbo3 went from ~0.45x → ~0.73x vs q8_0
Repo and benchmarks:
https://github.com/TheTom/turboquant_plus
Writeup:
https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/sparse-v-dequant.md
If anyone wants to try this on CUDA or other setups I’d be interested to see results.
Note: a CUDA port is currently being tested independently. Will share results once available.
[link] [comments]
![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3618325%252F470cf6d0-e54c-4ddf-8d83-e3db9f829f2b.jpg&w=3840&q=75)



