Dimensional Criticality at Grokking Across MLPs and Transformers

arXiv cs.LG / 4/21/2026

💬 OpinionModels & Research

共有:

Key Points

The paper proposes TDU–OFC, an offline “avalanche probe” that turns gradient snapshots into cascade statistics and produces a macroscopic observable, the time-resolved effective cascade dimension D(t), to study grokking transitions.
Across both modular-addition Transformers and XOR MLPs, D(t) shows a localized crossing of the Gaussian diffusion baseline D=1 precisely at the generalization transition.
The crossing direction depends on the task: modular addition approaches the transition from D>1 and descends through D=1, while XOR approaches from D<1 and ascends through D=1.
Multiple controls support a genuine dynamical critical-manifold interpretation: ungrokked runs stay supercritical (D>1), shadow-probe settings (α_train=0) indicate D(t) is non-invasive, and grokked and ungrokked trajectories begin to diverge 100–200 epochs before the behavioral transition.
The authors also find heavy-tailed avalanche distributions and finite-size scaling that align with a dimensional exponent inferred from D(t), strengthening the macroscopic criticality claim.

Abstract

Abrupt transitions between distinct dynamical regimes are a hallmark of complex systems. Grokking in deep neural networks provides a striking example -- an abrupt transition from memorization to generalization long after training accuracy saturates -- yet robust macroscopic signatures of this transition remain elusive. Here we introduce \textbf{TDU--OFC} (Thresholded Diffusion Update--Olami-Feder-Christensen), an offline avalanche probe that converts gradient snapshots into cascade statistics and extracts a \emph{macroscopic observable} -- the time-resolved effective cascade dimension

D(t)

-- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline

D=1

precisely at the generalization transition. The crossing direction is task-dependent: modular addition descends through

D=1

(approaching from

D>1

), while XOR ascends (from

D<1

). This opposite-direction convergence is consistent with attraction toward a candidate shared critical manifold, rather than trivial residence near

D \approx 1

. Negative controls confirm this picture: ungrokked runs remain supercritical (

D>1

) and never enter the post-transition regime. In addition, avalanche distributions exhibit heavy tails and finite-size scaling consistent with the dimensional exponent extracted from

D(t)

. Shadow-probe controls (

\alpha_{\mathrm{train}}=0

) confirm that

D(t)

is non-invasive, and grokked trajectories diverge from ungrokked ones in

D(t)

some

100

200

epochs before the behavioral transition.

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

Where is Grok-2 Mini and Grok-3 (mini)?

Reddit r/LocalLLaMA

Dimensional Criticality at Grokking Across MLPs and Transformers

Key Points

Abstract

Related Articles

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Where is Grok-2 Mini and Grok-3 (mini)?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer