Ordinary Least Squares is a Special Case of Transformer

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

本論文は、Transformerの本質を「普遍近似器」ではなく「既知の計算アルゴリズムのニューラル版」と捉えるため、OLSが単層の線形Transformerの特別な場合に相当することを代数的に証明した。
実データの共分散行列のスペクトル分解を用いて、注意機構の順伝播がOLSの閉形式（射影）と数学的に同値になる具体的パラメータ設定を構成している。

Abstract

The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.