TurboQuant + TriAttention (C/HIP): ~6.8× total KV cache reduction in llama.cpp

Reddit r/LocalLLaMA / 4/11/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post reports results from combining two KV-cache reduction techniques in llama.cpp on AMD/HIP: TurboQuant compression (~5.1×) and TriAttention pruning (75% retention, ~1.33×), yielding an estimated ~6.8× total KV-cache reduction and ~1.2 GiB KV at 131K context (from 8.2 GiB f16).
TurboQuant-only benchmarks for Qwen3.5-27B on an RX 7900 XTX show strong task performance (e.g., GSM8K 72.0% vs 66% f16, NIAH 28/28 to 64K, and tool calling 26/26) with minimal speed overhead (~1–2%) and small perplexity impact (+0.02% at 4K, -0.9% at 16K).
TriAttention is implemented in C/ggml and is based on an NVIDIA/MIT arXiv paper, with pre-built calibration stats included for the Qwen3 family, but the author clarifies that the ~6.8× figure is an arithmetic stack estimate and not fully validated end-to-end for retrieval quality.
The author states these are (as far as they know) the only HIP/ROCm TurboQuant implementation in llama.cpp and the only C/ggml TriAttention implementation, and invites additional testing from users on Strix Halo (gfx1201) and RDNA3 (gfx1100).

Edit (2026-04-11): Correction — my NIAH 28/28 results are TurboQuant-only, not the TriAttention combo. The ~6.8× figure is an arithmetic stack estimate (5.12× × 1.33×), not a validated end-to-end retrieval claim. TriAttention integration is promising on the PPL path but not yet validated for retrieval, especially on hybrid architectures. See TheTom's V3 analysis for rigorous testing.

Results from combining two KV-cache reduction methods in llama.cpp on AMD/HIP:

TurboQuant KV cache compression (turbo3): ~5.1× reduction
TriAttention KV cache pruning (75% retention): ~1.33× reduction
Combined: ~6.8× total KV reduction

At 131K context: f16 KV = 8.2 GiB → combo ≈ 1.2 GiB.

TurboQuant numbers (Qwen3.5-27B, RX 7900 XTX): - GSM8K: 72.0% on 1319 problems (vs 66% f16) - NIAH: 28/28 up to 64K context - Tool calling: 26/26 - PPL: +0.02% at 4K, -0.9% at 16K - Speed overhead: ~1-2%

TriAttention is based on the recent NVIDIA/MIT paper (arXiv:2604.04921). My implementation is in C/ggml — no Python needed at runtime. Pre-built calibration stats for Qwen3 family included.

As far as I know, this is currently the only HIP/ROCm TurboQuant implementation for llama.cpp and the only C/ggml implementation of TriAttention.

Repos: - TurboQuant (HIP): llama.cpp-turboquant-hip - TriAttention (C/ggml): triattention-ggml - llama.cpp discussion: #20969

3 users currently testing on Strix Halo (gfx1201) and RDNA3 (gfx1100). Feedback and testing results welcome.

submitted by /u/Acrobatic_Bee_6660
[link] [comments]