Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

arXiv cs.AI / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study on arXiv finds that learned memory tokens are empirically necessary for a single-block Universal Transformer using Adaptive Computation Time (ACT) to achieve meaningful performance on Sudoku-Extreme.
Performance shows a sharp minimum memory-token threshold: T=0 always fails, T=4 is borderline, and T=8 reliably solves 81-cell puzzles, after which accuracy plateaus for T=8–32 before collapsing at T=64 due to attention dilution.
The authors identify an ACT router initialization “trap” where common bias initializations cause tokens to halt after ~2 steps and get stuck in a shallow equilibrium; using a negative bias ("deep start") prevents this failure mode.
With reliable training, ACT outperforms fixed-depth processing in consistency, can match accuracy with fewer ponder steps via lambda warmup, and shows attention head specialization by recursive depth into roles like memory readers and constraint propagators.

Abstract

We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested -- 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing -- no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes >70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves' recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 ("deep start," p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at https://github.com/che-shr-cat/utm-jax.