Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

arXiv stat.ML / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a rigorous random-matrix-theory analysis of the self-attention matrix’s singular value spectrum using an asymptotic framework where inverse temperature stays constant order.
It establishes a “Gaussian equivalence” result, showing that the attention matrix’s singular value distribution is asymptotically described by a tractable linear model.
The authors find that the squared singular values do not follow the Marchenko–Pastur law, contradicting assumptions made in prior work.
The proof combines precise control of normalization-term fluctuations with a refined linearization strategy that exploits favorable Taylor expansions of the exponential function.
The work also derives a linearization threshold and explains why attention can still admit Gaussian equivalence despite not being an entrywise operation.

Abstract

Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026

Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

Dev.to

NEW PROMPT INJECTION

Dev.to

Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

Key Points

Abstract

Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

How AI Interview Assistants Are Changing Job Preparation in 2026

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

NEW PROMPT INJECTION

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer