Gradient Boosting within a Single Attention Layer

arXiv cs.LG / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes “gradient-boosted attention,” adding a second attention pass inside a single transformer attention layer that attends to the first pass’s prediction error and applies a learned gated correction.
Under a squared reconstruction objective, the authors show the method corresponds to Friedman's gradient boosting machine, treating each attention pass as a base learner and using a per-dimension gate as the shrinkage parameter.
The work analyzes dynamics of iterative updates (Hopfield-style and locally contracting behavior), showing how some query information can be erased or collapse to fixed points depending on the iteration regime.
Experiments on a 10M-token WikiText-103 subset report improved test perplexity (67.9) versus standard attention (72.2), with most gains achieved using two correction rounds.
The authors argue that using separate projection parameters for the correction pass can recover residual information that shared-projection variants (like Twicing Attention) may miss.

Abstract

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of

67.9

compared to

72.2

for standard attention,

69.6

for Twicing Attention, and

69.0

for a parameter-matched wider baseline, with two rounds capturing most of the benefit.

Black Hat Asia

AI Business

Оказывается, эта нейросеть рисует бесплатно. Я узнал случайно.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Three-Layer Memory Governance: Core, Provisional, Private

Dev.to

I Researched AI Prompting So You Don’t Have To

Dev.to

Gradient Boosting within a Single Attention Layer

Key Points

Abstract

Related Articles

Black Hat Asia

Оказывается, эта нейросеть рисует бесплатно. Я узнал случайно.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Three-Layer Memory Governance: Core, Provisional, Private

I Researched AI Prompting So You Don’t Have To

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer