Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

arXiv cs.CL / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current methods for increasing Transformer effective depth are rigid, statically allocating depth across parameters and layers throughout training, which creates training-time computational redundancy.
  • It introduces the Sparse Growing Transformer (SGT), a training-time sparse depth allocation approach that progressively extends recurrence from deeper to shallower layers using targeted attention looping on informative heads.
  • The method induces structural sparsity by increasing depth only for a small subset of parameters as training evolves, rather than applying additional computation uniformly.
  • Experiments on multiple parameter scales show SGT outperforms training-time static block-level looping baselines under comparable settings.
  • The approach substantially lowers training computational overhead, reducing additional training FLOPs from roughly 16–20% down to about 1–3% versus a standard Transformer backbone.

Abstract

Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.