Cascade Token Selection for Transformer Attention Acceleration

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper proposes a cascade approach to speed up representative token selection in transformer attention by reusing the representative set across consecutive layers.
The proposed Activation Decorrelation Attention (ADA) method typically needs an expensive per-layer Gram matrix (T×T), but the cascade reduces this by validating and updating the selected tokens using cheaper cross-Gram computations.
The computational cost of the token selection step is reduced from O(T^2 d) to O(T r d) per layer, enabling attention to be computed on a much smaller compressed problem.
Experiments on GPT-2 124M, GPT-J 6B, and OPT 6.7B with AMD MI300X show 22%–63% Gram-operation savings and high Jaccard overlap (0.83–0.94) between representative token sets of adjacent layers.
The results suggest that which tokens are informative is a structural property of the input that propagates coherently through network depth, so consecutive layers tend to rely on the same non-redundant tokens.

Abstract

A method is presented for reducing the cost of representative token selection in transformer attention layers by exploiting the coherence of the representative set across depth. Activation Decorrelation Attention (ADA) selects

r \ll T

representative tokens at each layer via a Gram threshold and computes attention on the compressed

r \times r

problem, but the selection requires a

T \times T

Gram matrix at every layer. The cascade mechanism introduced here inherits the representative set from layer

l

to layer

l+1

, validates it via a

(T - r) \times r

cross-Gram computation, and updates it with a small number of additions and removals. The cost of the selection step drops from

O(T^2 d)

O(T r d)

per layer. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates Gram operation savings of

22\%

63\%

with mean Jaccard overlap of

0.83

0.94

between consecutive layers. The cascade reveals that the set of informative tokens is a structural property of the input that propagates coherently through the depth of the network: the same tokens carry the non-redundant information at layer

l

and at layer

l+1

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

Cascade Token Selection for Transformer Attention Acceleration

Key Points

Abstract

Related Articles

SIFS (SIFS Is Fast Search) - local code search for coding agents

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer