Focus and Dilution: The Multi-stage Learning Process of Attention

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies transformer training dynamics and finds a recurrent “focus–dilution” cycle in how attention learning evolves over time.
  • It provides a rigorous explanation using gradient-flow analysis for a one-layer Transformer on Markovian data, decomposing one cycle into multiple distinct stages.
  • Early in training, embeddings and projections quickly condense into a rank-one structure while attention parameters stay nearly frozen.
  • As training progresses, attention parameters start changing to drive frequency-dependent focus toward high-frequency tokens, which later causes embedding perturbations and a mass-redistribution that dilutes that focus.
  • Small asymmetries among low-frequency tokens break degeneracies, open new embedding directions, and trigger subsequent focus–dilution cycles, supported by experiments on synthetic Markov data and WikiText/TinyStories.

Abstract

Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.