RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params)

Reddit r/LocalLLaMA / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • RotorQuant proposes replacing TurboQuant’s dense random orthogonal matrix with Clifford-algebra rotors (Cl(3,0)) applied via a rotor “sandwich product” on 3D chunks of vectors to reduce compute and parameter count.
  • The method uses a fused CUDA kernel and fused Metal shader implementation that avoids memory round-trips, reportedly achieving 10–19× speedups on NVIDIA RTX PRO 4000 and 9–31× on Apple M4 for Qwen2.5-3B-Instruct KV-cache operations.
  • Reported quality is effectively unchanged versus TurboQuant, with cosine similarity around 0.990 vs 0.991 and “needle-in-haystack” retrieval success at 9/9 across bit-widths.
  • RotorQuant claims 44× fewer parameters (372 vs 16,399 for d=128) and notes a tradeoff: higher synthetic MSE on random unit vectors, mitigated via QJL correction with preserved real-model attention fidelity.
RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params)

Kinda sounds ridiculous - but I reimagined / reinvented turboquant with Clifford Algebra Vector Quantization on both implemented on cuda + metalshaders -

https://github.com/tonbistudio/turboquant-pytorch/pull/4

https://github.com/TheTom/turboquant_plus/pull/34

https://preview.redd.it/mqwnea8iidrg1.png?width=2604&format=png&auto=webp&s=597710bff942ea68180f162ed147e134d33c9639

https://preview.redd.it/n9hjiq6iidrg1.png?width=2652&format=png&auto=webp&s=1ec464ada80dfff65ae7017ab9b834190ace2987

The idea: Replace the d×d random orthogonal matrix Π with Clifford rotors in Cl(3,0). Instead of a dense matmul (16,384 FMAs for

d=128), chunk the vector into groups of 3 dims and rotate each with a 4-parameter rotor via the sandwich product RvR̃ (~100 FMAs

total).

Results on Qwen2.5-3B-Instruct KV cache:

- Cosine similarity: 0.990 (vs TurboQuant's 0.991) — effectively identical
- 44× fewer parameters (372 vs 16,399 for d=128)
- Fused CUDA kernel: 10-19× faster than cuBLAS matmul on RTX PRO 4000
- Fused Metal shader: 9-31× faster on Apple M4
- Perfect 9/9 needle-in-haystack at all bit-widths

The key insight: for pure vectors, the rotor sandwich is equivalent to a sparse 3×3 rotation — the fused kernel keeps everything in registers with no memory round-trips, which is why it beats the BLAS GEMM despite TurboQuant's matmul being highly optimized.

The tradeoff is higher synthetic MSE on random unit vectors (the block-diagonal rotation doesn't induce the exact Beta distribution). But with QJL correction, real-model attention fidelity is identical — and sometimes better on top-1/top-5 retrieval.

Paper: https://www.scrya.com/rotorquant/

Code: https://github.com/scrya-com/rotorquant

PDF: https://www.scrya.com/rotorquant.pdf

submitted by /u/Revolutionary_Ask154
[link] [comments]