Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions
arXiv cs.AI / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- A first-principles theory explains grokking as a norm-driven representational phase transition during regularized training, where the model moves from high-norm memorization to a lower-norm generalized representation.
- The authors derive a scaling law for the grokking delay: T_grok - T_mem = Theta((1 / gamma_eff) * log(||theta_mem||^2 / ||theta_post||^2)), with gamma_eff depending on the optimizer (SGD or AdamW).
- They validate the theory with 293 training runs across modular addition, modular multiplication, and sparse parity tasks, confirming inverse scaling with weight decay and learning rate, and logarithmic dependence on the norm ratio (R^2 > 0.97).
- The results show that the optimizer must decouple memorization from contraction; SGD can fail to grok under hyperparameters where AdamW reliably groks.
- The work provides the first quantitative scaling law for grokking delay and frames grokking as a predictable consequence of norm separation between competing interpolating representations.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA