Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

Open-TQ-Metal is a new open-source implementation that brings fused compressed-domain attention to Apple Silicon, enabling 128K-context Llama 3.1 70B inference on a single 64GB consumer Mac.
The approach quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation using custom Metal compute shaders, avoiding intermediate dequantization matrices.
In 330 experiments across Gemma 4 (31B) and Llama 3.1 (70B), the fused sdpa_int4 kernel delivers a reported 48× attention speedup at 128K context versus a dequantize-then-attend baseline.
The method reduces KV cache memory from 40GB to 12.5GB (3.2× compression) while maintaining identical top-1 token predictions compared with FP16 inference.
The paper also provides cross-architecture findings on KV cache quantization, arguing that the attention scale factor—rather than model size—governs whether angular quantization schemes (e.g., PolarQuant) succeed.

Abstract

We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor -- not model size -- determines whether angular quantization schemes like PolarQuant succeed or fail, with Gemma 4's attn_scale=1.0 amplifying directional error 25-100x more than Llama's standard 1/sqrt(d) scaling.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/21DailyView insight →

A practical guide to getting comfortable with AI coding tools

Dev.to

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Dev.to

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

Key Points

Abstract

💡 Insights using this article

Related Articles

A practical guide to getting comfortable with AI coding tools

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer