Transactional Attention: Semantic Sponsorship for KV-Cache Retention

arXiv cs.CL / 4/14/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Existing KV-cache compression methods fail to retain sensitive credential tokens when K is small (K=16, about 0.4% of a 4K context), yielding 0% credential retrieval despite various attention/reconstruction/retention-gating approaches.
The paper identifies a key failure mode: “dormant tokens” (e.g., credentials, API keys, config values) that receive near-zero attention during encoding but are required later during generation.
It proposes Transactional Attention (TA), a semantic sponsorship mechanism that uses structural anchor patterns (such as "key:" or "password:") to protect adjacent value-bearing tokens from eviction.
TA achieves 100% credential retrieval at K=16 and maintains 100% accuracy across 200 function-calling trials, outperforming six named KV-cache compression baselines that score 0%.
TA-Fast, an attention-free variant, cuts memory overhead by 52%, is compatible with SDPA/FlashAttention, and adds under 1% latency overhead while being orthogonal to existing compression techniques.

Abstract

At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., "key:", "password:") protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead by 52% and is compatible with SDPA and FlashAttention. TA is orthogonal to existing compression methods and adds less than 1% latency overhead.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/14DailyView insight →

How AI Coding Assistants Actually Changed My Workflow (And Where They Still Fall Short)

Dev.to

The Magic of Auto-Sync: How AI Updates Ten Schedules from One Change

Dev.to

# 🚀 5 Unique Project Ideas That 99% Developers Don’t Build

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Building AI Tools for Invisible Disabilities: Aphantasia, TBI, and the Right to Create

Dev.to

Transactional Attention: Semantic Sponsorship for KV-Cache Retention

Key Points

Abstract

💡 Insights using this article

Related Articles

How AI Coding Assistants Actually Changed My Workflow (And Where They Still Fall Short)

The Magic of Auto-Sync: How AI Updates Ten Schedules from One Change

# 🚀 5 Unique Project Ideas That 99% Developers Don’t Build

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Building AI Tools for Invisible Disabilities: Aphantasia, TBI, and the Right to Create

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer