Pure C implementation of the TurboQuant paper (ICLR 2026) for KV cache compression in LLM inference.

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article shares a pure C (with NEON) implementation of the ICLR 2026 TurboQuant approach for KV cache compression to reduce LLM inference memory costs.
It compresses key vectors down to 1-bit using a randomized Hadamard transform plus sign hashing, and computes attention using XOR and popcount operations.
It independently quantizes values to either Q4 or Q2, achieving total K+V compression of about 4.9x–7.1x on Gemma 3 4B.
Reported results include up to ~3.7 GB KV cache savings at 32K context, with a 1-bit attention cosine score of 0.634 that matches a theoretical limit of 2/pi.
The implementation is presented as dependency-free and verified with scalar/NEON cross-checks, ASan-clean code, and 26 test suites, with the code published on GitHub.

Pure C implementation of the TurboQuant paper (ICLR 2026) for KV cache compression in LLM inference.

Key vectors compressed to 1 bit via randomized Hadamard transform + sign hashing. Attention via XOR + popcount. Values independently quantized to Q4 or Q2. Total K+V: 4.9x–7.1x compression on Gemma 3 4B, saving up to 3.7 GB at 32K context.

1-bit attention cosine = 0.634, matching the 2/pi theoretical limit. All NEON paths verified against scalar reference. ASan clean, 26 test suites. No external dependencies.

https://github.com/quantumaikr/TurboQuant.cpp

submitted by /u/Suitable-Song-302
[link] [comments]