Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • Open-TQ-Metal is a new open-source implementation that brings fused compressed-domain attention to Apple Silicon, enabling 128K-context Llama 3.1 70B inference on a single 64GB consumer Mac.
  • The approach quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation using custom Metal compute shaders, avoiding intermediate dequantization matrices.
  • In 330 experiments across Gemma 4 (31B) and Llama 3.1 (70B), the fused sdpa_int4 kernel delivers a reported 48× attention speedup at 128K context versus a dequantize-then-attend baseline.
  • The method reduces KV cache memory from 40GB to 12.5GB (3.2× compression) while maintaining identical top-1 token predictions compared with FP16 inference.
  • The paper also provides cross-architecture findings on KV cache quantization, arguing that the attention scale factor—rather than model size—governs whether angular quantization schemes (e.g., PolarQuant) succeed.

Abstract

We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor -- not model size -- determines whether angular quantization schemes like PolarQuant succeed or fail, with Gemma 4's attn_scale=1.0 amplifying directional error 25-100x more than Llama's standard 1/sqrt(d) scaling.