attn-rot (ggerganov's "TurboQuant lite") is on the cusp of getting merged into llama.cpp

Reddit r/LocalLLaMA / 4/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • Reddit post says attn-rot (ggerganov’s “TurboQuant lite”) is nearing merge into the llama.cpp codebase, implying wider access in a popular local LLM runtime.
  • Reported benchmark results for Qwen3.5-35B-A3B compare attn-rot vs the master branch, showing very similar KLD and top-p behavior across KV quantization levels.
  • The tables indicate little quality regression when using attn-rot under different KV quant schemes (e.g., q8_0 and q4_0) while retaining close output distribution characteristics.
  • Throughput (t/s) measurements are shown for both full precision (bf16) and quantized KV types, suggesting performance remains competitive for VRAM-constrained deployments.
  • The author frames the change as imminent and deletes/updates once merged, signaling an early integration signal for developers tracking llama.cpp improvements.
attn-rot (ggerganov's "TurboQuant lite") is on the cusp of getting merged into llama.cpp

gonna delete this as soon as it's merged, just couldn't contain my excitement. LOOK AT THAT BENCHIE:

Qwen3.5-35B-A3B (master) fully in VRAM:

KV quant mean KLD 99% KLD same top p
q8_0 0.003778 ± 0.000058 0.035869 97.303 ± 0.042
q4_0 0.010338 ± 0.000085 0.078723 95.331 ± 0.055
type_k type_v test t/s
bf16 bf16 pp512 5263.78 ± 23.30
bf16 bf16 tg128 173.58 ± 0.46
q8_0 q8_0 pp512 5210.77 ± 124.88
q8_0 q8_0 tg128 172.11 ± 0.50
q4_0 q4_0 pp512 5263.64 ± 15.16
q4_0 q4_0 tg128 171.63 ± 0.66

Qwen3.5-35B-A3B (attn-rot) fully in VRAM:

KV quant mean KLD 99% KLD same top p
q8_0 0.003702 ± 0.000039 0.035608 97.355 ± 0.042
q4_0 0.007657 ± 0.000085 0.062180 96.070 ± 0.051
type_k type_v test t/s
bf16 bf16 pp512 5270.17 ± 25.16
bf16 bf16 tg128 173.47 ± 0.19
q8_0 q8_0 pp512 5231.55 ± 29.73
q8_0 q8_0 tg128 167.07 ± 0.75
q4_0 q4_0 pp512 5245.99 ± 21.93
q4_0 q4_0 tg128 166.47 ± 0.72

Qwen3.5-27B (master) fully in VRAM:

KV quant mean KLD 99% KLD same top p
q8_0 0.001178 ± 0.000157 0.004762 98.987 ± 0.026
q4_0 0.007168 ± 0.000310 0.041270 97.021 ± 0.044
type_k type_v test t/s
bf16 bf16 pp512 2152.75 ± 32.84
bf16 bf16 tg128 42.84 ± 0.01
q8_0 q8_0 pp512 2153.43 ± 32.27
q8_0 q8_0 tg128 42.74 ± 0.01
q4_0 q4_0 pp512 2152.57 ± 28.21
q4_0 q4_0 tg128 42.66 ± 0.02

Qwen3.5-27B (attn-rot) fully in VRAM:

KV quant mean KLD 99% KLD same top p
q8_0 0.001105 ± 0.000126 0.004725 98.966 ± 0.026
q4_0 0.005305 ± 0.000304 0.029281 97.604 ± 0.040
type_k type_v test t/s
bf16 bf16 pp512 2150.84 ± 31.88
bf16 bf16 tg128 42.85 ± 0.02
q8_0 q8_0 pp512 2141.86 ± 36.03
q8_0 q8_0 tg128 42.27 ± 0.03
q4_0 q4_0 pp512 2138.60 ± 31.63
q4_0 q4_0 tg128 42.20 ± 0.02

Qwen3.5-122B-A10B (master) n-cpu-mode=27:

KV quant mean KLD 99% KLD same top p
q8_0 0.003275 ± 0.000027 0.039921 97.844 ± 0.038
q4_0 0.008272 ± 0.000065 0.081220 96.281 ± 0.049
type_k type_v test t/s
bf16 bf16 pp512 193.94 ± 54.32
bf16 bf16 tg128 27.17 ± 0.21
q8_0 q8_0 pp512 191.27 ± 56.92
q8_0 q8_0 tg128 27.27 ± 0.11
q4_0 q4_0 pp512 194.80 ± 55.64
q4_0 q4_0 tg128 27.22 ± 0.03

Qwen3.5-122B-A10B (attn-rot) n-cpu-mode=27:

KV quant mean KLD 99% KLD same top p
q8_0 0.003285 ± 0.000027 0.039585 97.824 ± 0.038
q4_0 0.006311 ± 0.000045 0.064831 96.895 ± 0.045
type_k type_v test t/s
bf16 bf16 pp512 194.84 ± 56.23
bf16 bf16 tg128 27.30 ± 0.17
q8_0 q8_0 pp512 194.10 ± 55.76
q8_0 q8_0 tg128 27.00 ± 0.10
q4_0 q4_0 pp512 194.87 ± 56.16
q4_0 q4_0 tg128 27.21 ± 0.06
submitted by /u/Dany0
[link] [comments]