TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed)

Reddit r/LocalLLaMA / 3/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article shares both the code repository (turboquant-mlx) and a PR to mlx-lm, suggesting the work is being actively integrated into the MLX ecosystem.

Implemented TurboQuant (Google's new KV cache compression paper) for MLX with fused Metal kernels.

Results on Qwen2.5-32B, M4 Pro 48GB:

- 4.6x compression, 0.98x FP16 speed, identical quality

- 16K context: 4.2GB cache → 897MB

The main challenge was speed — went from 0.28x to 0.98x FP16 through fused Metal quantize/dequantize kernels and an incremental decode buffer.

Writeup with the full optimization journey: https://medium.com/@antonrozanov/turboquant-on-mlx-4-6x-kv-cache-compression-with-custom-metal-kernels-9cdee3f7d2a2

Code: https://github.com/arozanov/turboquant-mlx

PR to mlx-lm: https://github.com/ml-explore/mlx-lm/pull/1067

submitted by /u/dirtyhand3
[link] [comments]