AI Navigate

インサイト最新記事一覧 AI大全

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed)

Reddit r/LocalLLaMA / 3/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The article shares both the code repository (turboquant-mlx) and a PR to mlx-lm, suggesting the work is being actively integrated into the MLX ecosystem.

Implemented TurboQuant (Google's new KV cache compression paper) for MLX with fused Metal kernels.

Results on Qwen2.5-32B, M4 Pro 48GB:

- 4.6x compression, 0.98x FP16 speed, identical quality

- 16K context: 4.2GB cache → 897MB

The main challenge was speed — went from 0.28x to 0.98x FP16 through fused Metal quantize/dequantize kernels and an incremental decode buffer.

Writeup with the full optimization journey: https://medium.com/@antonrozanov/turboquant-on-mlx-4-6x-kv-cache-compression-with-custom-metal-kernels-9cdee3f7d2a2

Code: https://github.com/arozanov/turboquant-mlx

PR to mlx-lm: https://github.com/ml-explore/mlx-lm/pull/1067

submitted by /u/dirtyhand3
[link] [comments]

Related Articles

Black Hat Asia

Black Hat Asia

AI Business

Built a mortgage OCR system that hit 100% final accuracy in production (US/UK underwriting)

Reddit r/LocalLLaMA

# I Created a Pagination Challenge… And AI Missed the Real Problem

# I Created a Pagination Challenge… And AI Missed the Real Problem

Dev.to

Xata Has a Free Serverless Database — PostgreSQL With Built-in Search, Analytics, and AI

Xata Has a Free Serverless Database — PostgreSQL With Built-in Search, Analytics, and AI

Dev.to

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You

Dev.to

関連おすすめサービス

※当サイトはアフィリエイト広告を利用しています

Notta搭載AI議事録イヤホン ZENCHORD1

AI時代の仕事術。Notta搭載で会議の議事録を自動生成するスマートイヤホン。

AI搭載ボイスレコーダー Plaud

世界100万人が愛用。AIで文字起こし・要約を自動化するボイスレコーダー。

画像高画質化AIツール Aiarty Image Enhancer

AIで画像を高画質化。写真・イラストを簡単にアップスケール。