Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

arXiv stat.ML / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that, under stochastic scaling, the token dynamics across layers in a finite transformer with MLP blocks converge (pathwise) to a continuous-time stochastic interacting particle system.
  • It derives the specific stochastic partial differential equation (SPDE) that governs how the token distribution evolves in the limiting model.
  • The authors prove propagation of chaos, establishing that as the number of tokens grows large, tokens behave increasingly independently while still following the same limiting law.
  • The study demonstrates “synchronization by noise,” meaning the limiting stochastic system exhibits exponential decay of interaction energy on average when common noise is strong enough relative to the deterministic self-attention drift.
  • It also characterizes which activation functions satisfy the coercivity condition required for the noise-driven synchronization results.

Abstract

We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.