The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

arXiv cs.LG / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that standard Transformers are temporally shallow because each layer attends to key/value representations computed only from the previous layer, limiting effective depth to the number of layers.
  • It proposes the “Recurrent Transformer,” where each layer attends to key/value pairs computed from its own activations, adding layerwise recurrent memory while keeping standard autoregressive decoding cost.
  • The authors show the architecture can emulate both a conventional Transformer and token-to-token recurrent updates, and they claim it avoids the optimization instability often seen in recurrent models.
  • They introduce a tiling-based algorithm that preserves the same math but reduces HBM memory traffic for prefill/training from Θ(N²) to Θ(N log N), increasing effective arithmetic intensity to Θ(N/log N).
  • Experiments on 150M and 300M parameter C4 pretraining indicate improved cross-entropy over a matched Transformer baseline, with the gains achieved using fewer layers—potentially reducing KV cache memory footprint and inference latency.

Abstract

Transformers process tokens in parallel but are temporally shallow: at position t, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near 1 because keys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematical computation while reducing HBM traffic from \Theta(N^2) to \Theta(N\log N), increasing effective arithmetic intensity to \Theta(N/\log N) for sequence length N. On 150M and 300M parameter C4 pretraining, Recurrent Transformers improve cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewer layers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memory footprint and inference latency.