Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks
arXiv cs.LG / 3/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proves that computing a trigger-conditional behavior necessarily induces a sink in softmax self-attention due to normalization on the probability simplex, formalizing why attention tends to collapse onto a stable anchor to realize a default state.
- It uses a concrete task: when a designated trigger token appears, the model must output the average of all preceding token representations, and otherwise return zero, linking the sink behavior to a real-world attention pattern.
- The authors show that non-normalized ReLU attention can solve the same task without any sink, highlighting normalization as the fundamental driver of sink behavior.
- Experiments demonstrate that softmax models develop strong sinks in both single-head and multi-head variants, while ReLU attention eliminates them, and the findings extend beyond the theoretically analyzed setting.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to