Kwai Summary Attention Technical Report

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that long-context needs are central to next-generation LLMs, but standard softmax attention becomes too expensive because its runtime grows quadratically with sequence length.
Existing approaches either reduce KV-cache size (though it still scales linearly with sequence length) or change attention architecture for KV-cache friendliness, often trading off long-context effectiveness.
The authors propose an intermediate direction: keep KV cache scaling linearly with sequence length while applying semantic-level compression at a chosen ratio k, aiming for O(n/k) cost without “minimum KV cache.”
They introduce Kwai Summary Attention (KSA), which compresses historical context into learnable summary tokens to lower long-sequence training and inference costs while retaining interpretable long-distance dependencies.
The work positions KSA as a new attention mechanism designed to balance memory/computation savings with strong long-range semantic understanding for tasks like reasoning, code agentic intelligence, and recommendations.

Abstract

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio

k

}. This

O(n/k)

path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

Dev.to

Kwai Summary Attention Technical Report

Key Points

Abstract

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer