Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
arXiv cs.LG / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- Open-TQ-Metal is a new open-source implementation that brings fused compressed-domain attention to Apple Silicon, enabling 128K-context Llama 3.1 70B inference on a single 64GB consumer Mac.
- The approach quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation using custom Metal compute shaders, avoiding intermediate dequantization matrices.
- In 330 experiments across Gemma 4 (31B) and Llama 3.1 (70B), the fused sdpa_int4 kernel delivers a reported 48× attention speedup at 128K context versus a dequantize-then-attend baseline.
- The method reduces KV cache memory from 40GB to 12.5GB (3.2× compression) while maintaining identical top-1 token predictions compared with FP16 inference.
- The paper also provides cross-architecture findings on KV cache quantization, arguing that the attention scale factor—rather than model size—governs whether angular quantization schemes (e.g., PolarQuant) succeed.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

A practical guide to getting comfortable with AI coding tools
Dev.to

Competitive Map: 10 AI Agent Platforms vs AgentHansa
Dev.to

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to