Focus and Dilution: The Multi-stage Learning Process of Attention
arXiv cs.LG / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies transformer training dynamics and finds a recurrent “focus–dilution” cycle in how attention learning evolves over time.
- It provides a rigorous explanation using gradient-flow analysis for a one-layer Transformer on Markovian data, decomposing one cycle into multiple distinct stages.
- Early in training, embeddings and projections quickly condense into a rank-one structure while attention parameters stay nearly frozen.
- As training progresses, attention parameters start changing to drive frequency-dependent focus toward high-frequency tokens, which later causes embedding perturbations and a mass-redistribution that dilutes that focus.
- Small asymmetries among low-frequency tokens break degeneracies, open new embedding directions, and trigger subsequent focus–dilution cycles, supported by experiments on synthetic Markov data and WikiText/TinyStories.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to