| Zero failures across 300 seeds. 66× speedup. 5 lines of code. We're two independent researchers. The method: per-row ℓ₂ clipping on decoder weights after every optimizer step. No additional memory, no weight decay needed. Results on the standard grokking benchmark (modular arithmetic, decoder-only transformer, same setup as Grokfast [2024]):
Honest scope: all experiments are modular arithmetic. We're running a 277M LLM test but it'll take weeks on our hardware and results may not transfer cleanly — we're not claiming otherwise. Happy to share progress, dataset, and full model/training parameters. Code + PDF: We're seeking arXiv endorsement (cs.LG) — DM if willing. [link] [comments] |
[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo
Reddit r/MachineLearning / 3/18/2026
📰 NewsModels & Research
Key Points
- The method clips decoder weight vectors per row using L2 norm after every optimizer step, with no extra memory and no weight decay.
- In Grokking-style benchmarks, a 2-layer model (422k params) achieves 66× speedup over the AdamW baseline with Lion+Clip, while an 8-layer model (1.6M params) achieves 18× speedup with zero failures across 300 seeds and reduced interquartile range with edge initialization.
- The authors note that experiments are limited to modular arithmetic on decoder-only transformers, and they are running a 277M-parameter LLM test whose results may not transfer and are not claimed to generalize yet.
- Code and PDF are available on GitHub (cliptogrok), and they are seeking arXiv endorsement (cs.LG).
Related Articles
Self-Refining Agents in Spec-Driven Development
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks
Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights
Reddit r/LocalLLaMA
Best open source coding models for claude code? LB?
Reddit r/LocalLLaMA