TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
arXiv cs.CL / 4/7/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the KV cache memory bottleneck in long-form reasoning for LLMs by improving KV cache compression and key-importance estimation.
- It argues that common approaches relying on recent post-RoPE attention scores fail because RoPE rotates Q/K with position, producing unstable and unrepresentative “top-key” selection.
- TriAttention instead operates in the pre-RoPE space, leveraging an observed concentration of Q and K vectors around fixed non-zero centers that yields stable, distance-preferring attention behavior.
- Using a trigonometric series derived from the concentration centers (plus Q/K norm as an auxiliary signal), TriAttention scores and keeps keys more effectively for reasoning.
- Experiments on AIME25 (32K-token generation) show TriAttention matches full-attention reasoning accuracy while improving throughput by 2.5x and/or reducing KV memory by 10.7x, enabling OpenClaw deployment on a single consumer GPU without OOM for long contexts.
Related Articles

Black Hat Asia
AI Business

OpenAI's pricing is about to change — here's why local AI matters more than ever
Dev.to

Google AI Tells Users to Put Glue on Their Pizza!
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Could it be that this take is not too far fetched?
Reddit r/LocalLLaMA