Gradient Boosting within a Single Attention Layer

arXiv cs.LG / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “gradient-boosted attention,” adding a second attention pass inside a single transformer attention layer that attends to the first pass’s prediction error and applies a learned gated correction.
  • Under a squared reconstruction objective, the authors show the method corresponds to Friedman's gradient boosting machine, treating each attention pass as a base learner and using a per-dimension gate as the shrinkage parameter.
  • The work analyzes dynamics of iterative updates (Hopfield-style and locally contracting behavior), showing how some query information can be erased or collapse to fixed points depending on the iteration regime.
  • Experiments on a 10M-token WikiText-103 subset report improved test perplexity (67.9) versus standard attention (72.2), with most gains achieved using two correction rounds.
  • The authors argue that using separate projection parameters for the correction pass can recover residual information that shared-projection variants (like Twicing Attention) may miss.

Abstract

Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of 67.9 compared to 72.2 for standard attention, 69.6 for Twicing Attention, and 69.0 for a parameter-matched wider baseline, with two rounds capturing most of the benefit.