Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions
arXiv cs.AI / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- A first-principles theory explains grokking as a norm-driven representational phase transition during regularized training, where the model moves from high-norm memorization to a lower-norm generalized representation.
- The authors derive a scaling law for the grokking delay: T_grok - T_mem = Theta((1 / gamma_eff) * log(||theta_mem||^2 / ||theta_post||^2)), with gamma_eff depending on the optimizer (SGD or AdamW).
- They validate the theory with 293 training runs across modular addition, modular multiplication, and sparse parity tasks, confirming inverse scaling with weight decay and learning rate, and logarithmic dependence on the norm ratio (R^2 > 0.97).
- The results show that the optimizer must decouple memorization from contraction; SGD can fail to grok under hyperparameters where AdamW reliably groks.
- The work provides the first quantitative scaling law for grokking delay and frames grokking as a predictable consequence of norm separation between competing interpolating representations.
Related Articles
Day 10: 230 Sessions of Hustle and It Comes Down to One Person Reading a Document
Dev.to

5 Dangerous Lies Behind Viral AI Coding Demos That Break in Production
Dev.to
Two bots, one confused server: what Nimbus revealed about AI agent identity
Dev.to

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.
Dev.to
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance
Dev.to