Edit (2026-04-11): Correction — my NIAH 28/28 results are TurboQuant-only, not the TriAttention combo. The ~6.8× figure is an arithmetic stack estimate (5.12× × 1.33×), not a validated end-to-end retrieval claim. TriAttention integration is promising on the PPL path but not yet validated for retrieval, especially on hybrid architectures. See TheTom's V3 analysis for rigorous testing.
Results from combining two KV-cache reduction methods in llama.cpp on AMD/HIP:
- TurboQuant KV cache compression (turbo3): ~5.1× reduction
- TriAttention KV cache pruning (75% retention): ~1.33× reduction
- Combined: ~6.8× total KV reduction
At 131K context: f16 KV = 8.2 GiB → combo ≈ 1.2 GiB.
TurboQuant numbers (Qwen3.5-27B, RX 7900 XTX): - GSM8K: 72.0% on 1319 problems (vs 66% f16) - NIAH: 28/28 up to 64K context - Tool calling: 26/26 - PPL: +0.02% at 4K, -0.9% at 16K - Speed overhead: ~1-2%
TriAttention is based on the recent NVIDIA/MIT paper (arXiv:2604.04921). My implementation is in C/ggml — no Python needed at runtime. Pre-built calibration stats for Qwen3 family included.
As far as I know, this is currently the only HIP/ROCm TurboQuant implementation for llama.cpp and the only C/ggml implementation of TriAttention.
Repos: - TurboQuant (HIP): llama.cpp-turboquant-hip - TriAttention (C/ggml): triattention-ggml - llama.cpp discussion: #20969
3 users currently testing on Strix Halo (gfx1201) and RDNA3 (gfx1100). Feedback and testing results welcome.
[link] [comments]



