TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization

arXiv cs.LG / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

TurboAngle proposes compressing transformer KV caches by quantizing angles after applying a random diagonal rotation in the Fast Walsh-Hadamard domain to make consecutive element pairs more uniformly distributed on the unit circle.
The method adds a per-layer “early-boost” mechanism that independently selects K and V codebook sizes per layer, giving higher precision to a model-specific subset of critical layers.
Experiments across seven models (1B–7B parameters) show lossless compression for 4 models and near-lossless quality for 6 of 7 at roughly 3.28–3.67 angle bits per element.
An asymmetric quantization variant (8-bit keys, 4-bit log-space values) achieves 6.56 total bits per element on Mistral-7B with only +0.0014 perplexity degradation and no calibration data.
A sensitivity analysis identifies model-specific bottleneck patterns, including K-dominated versus V-dominated layers and negative-transfer layers where allocating more precision can worsen quality.

Abstract

We compress KV cache entries by quantizing angles in the Fast Walsh-Hadamard domain, where a random diagonal rotation makes consecutive element pairs approximately uniformly distributed on the unit circle. We extend this angular quantizer with per-layer early-boost, which independently configures K and V codebook sizes at each layer, allocating higher precision to a model-specific subset of critical layers. Across seven models (1B to 7B parameters), per-layer early-boost achieves lossless compression on four models and near-lossless quality on six of seven, at 3.28 to 3.67 angle bits per element. Asymmetric norm quantization (8-bit for keys, 4-bit log-space for values) yields 6.56 total bits per element on Mistral-7B with perplexity degradation of +0.0014 and no calibration data. A layer-group sensitivity analysis reveals model-specific bottleneck patterns, including K-dominated versus V-dominated layers and negative-transfer layers where increased precision degrades quality.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/31DailyView insight →

Black Hat Asia

AI Business

Just a helpful open-source contributor

Reddit r/LocalLLaMA

v0.18.2rc0

vLLM Releases

South Korean AI Chipmaker Raises $400 Million for Inference

AI Business

Ollama is now powered by MLX on Apple Silicon in preview

Dev.to

TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Just a helpful open-source contributor

v0.18.2rc0

South Korean AI Chipmaker Raises $400 Million for Inference

Ollama is now powered by MLX on Apple Silicon in preview

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer