| I’ve been testing TurboQuant this week on two machines and wanted to share the actual numbers. Why this matters: TurboQuant compresses the KV cache, not the model weights. On long contexts, KV cache can take several GB of memory, so reducing it can make a big difference even when throughput stays similar. In the setup I tested, K stays at q8_0 and V goes to turbo3 (~3-bit). That asymmetric tradeoff makes sense because errors in the keys affect attention routing more directly, while values often tolerate heavier compression better. Benchmark 1: Mac Mini M4 16GB — Qwen3-14B Q4_K_M at 8K context → Without TurboQuant: KV cache 1280 MiB, K (f16): 640 MiB, V (f16): 640 MiB — 9.95 t/s → With TurboQuant: KV cache 465 MiB, K (q8_0): 340 MiB, V (turbo3): 125 MiB — 9.25 t/s Almost 3x compression, with pretty similar speed. Benchmark 2: M3 Max 48GB — Qwen3.5 35B A3B Q6 at 128K context → Without TurboQuant: KV cache 2560 MiB, K (f16): 1280 MiB, V (f16): 1280 MiB — 45.34 t/s → With TurboQuant: KV cache 930 MiB, K (q8_0): 680 MiB, V (turbo3): 250 MiB — 42.88 t/s How to run it This uses the community fork by TheTom, which includes Metal kernels for Apple Silicon. It’s not in mainline llama.cpp yet, although PRs are open. # Clone the TurboQuant fork (not in mainline llama.cpp yet) git clone https://github.com/TheTom/llama-cpp-turboquant.git cd llama-cpp-turboquant git checkout feature/turboquant-kv-cache # Configure with Metal (Apple Silicon GPU) cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release # Compile using all CPU cores cmake --build build -j$(sysctl -n hw.ncpu) # Run with TurboQuant: keys at q8_0, values compressed with turbo3 ./build/bin/llama-server Full walkthrough on YouTube soon. [link] [comments] |
TurboQuant on Apple Silicon: real benchmarks on Mac Mini M4 16GB and M3 Max 48GB
Reddit r/LocalLLaMA / 4/6/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- TurboQuant reduces only the KV cache (not model weights), which can substantially lower multi-GB memory use during long-context inference while keeping throughput relatively similar.
- On an M4 Mac Mini (16GB) running Qwen3-14B at 8K context, KV cache drops from 1280 MiB to 465 MiB (~3x) with modest speed change (9.95 t/s to 9.25 t/s).
- On an M3 Max (48GB) running Qwen3.5 35B at 128K context, KV cache drops from 2560 MiB to 930 MiB (~3x) with a smaller throughput decrease (45.34 t/s to 42.88 t/s).
- The asymmetrical compression setup described keeps Keys at q8_0 and compresses Values to turbo3 (~3-bit), reflecting the author’s view that key precision impacts attention routing more directly than value precision.
- The benchmarks use a community fork (TheTom) with Metal kernels for Apple Silicon; it is not in mainline llama.cpp yet, but the author points to open PRs and provides build steps to enable it.




