Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

arXiv cs.AI / 5/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study argues that Transformers’ tendency to learn low-complexity reasoning rather than high-complexity memorization is controlled by complexity control mechanisms influenced by initialization scale and weight decay, but it had been unclear when this control becomes decisive during training.
The authors show that the memorization-versus-reasoning outcome is determined within a sharp, identifiable “critical window” during training, rather than being effectively static throughout.
Experiments on a controlled compositional task indicate that applying weight decay only for a 25% slice of training can match full-training performance on out-of-distribution accuracy, and that—with the same total regularization budget—placing regularization mid-training yields substantially higher OOD accuracy than placing it early.
The critical window’s timing is highly sensitive: shifting its onset by as little as ~100 optimization steps can move performance from chance-level to a reasoning regime, revealing an abrupt boundary.
The window’s location depends on initialization scale, and importantly the reasoning “basin of attraction” shrinks for small initialization—contradicting the common recommendation that smaller initialization is always better; the phenomenon is also not universal across tasks (e.g., modular arithmetic grokking does not show it).

Abstract

Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emph{when} during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25\%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy (

0.93

0.91

); (ii)~holding total regularization budget constant, placing it in the middle of training yields

5{-}9\times

higher OOD accuracy than placing it early; (iii)~the boundary of the critical window is remarkably sharp, window onset shifted by as little as

100

optimization steps causes mean OOD to jump from chance (

0.15

) to reasoning-regime (

0.61

); (iv)~the window's position depends systematically on initialization scale, but the basin of attraction for reasoning solutions \emph{shrinks} at small initialization, contradicting the prevailing recommendation that smaller initialization is uniformly better. We further show that the critical-window phenomenon is task-specific: it does not appear on grokking with modular arithmetic, where properly tuned constant weight decay matches scheduled weight decay.