Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

arXiv cs.AI / 5/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study argues that Transformers’ tendency to learn low-complexity reasoning rather than high-complexity memorization is controlled by complexity control mechanisms influenced by initialization scale and weight decay, but it had been unclear when this control becomes decisive during training.
  • The authors show that the memorization-versus-reasoning outcome is determined within a sharp, identifiable “critical window” during training, rather than being effectively static throughout.
  • Experiments on a controlled compositional task indicate that applying weight decay only for a 25% slice of training can match full-training performance on out-of-distribution accuracy, and that—with the same total regularization budget—placing regularization mid-training yields substantially higher OOD accuracy than placing it early.
  • The critical window’s timing is highly sensitive: shifting its onset by as little as ~100 optimization steps can move performance from chance-level to a reasoning regime, revealing an abrupt boundary.
  • The window’s location depends on initialization scale, and importantly the reasoning “basin of attraction” shrinks for small initialization—contradicting the common recommendation that smaller initialization is always better; the phenomenon is also not universal across tasks (e.g., modular arithmetic grokking does not show it).

Abstract

Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25\%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy (0.93 vs 0.91); (ii)~holding total regularization budget constant, placing it in the middle of training yields 5{-}9\times higher OOD accuracy than placing it early; (iii)~the boundary of the critical window is remarkably sharp, window onset shifted by as little as 100 optimization steps causes mean OOD to jump from chance (0.15) to reasoning-regime (0.61); (iv)~the window's position depends systematically on initialization scale, but the basin of attraction for reasoning solutions \emph{shrinks} at small initialization, contradicting the prevailing recommendation that smaller initialization is uniformly better. We further show that the critical-window phenomenon is task-specific: it does not appear on grokking with modular arithmetic, where properly tuned constant weight decay matches scheduled weight decay.