[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo

Reddit r/MachineLearning / 3/18/2026

📰 NewsModels & Research

Read original →

共有:

Key Points

The method clips decoder weight vectors per row using L2 norm after every optimizer step, with no extra memory and no weight decay.
In Grokking-style benchmarks, a 2-layer model (422k params) achieves 66× speedup over the AdamW baseline with Lion+Clip, while an 8-layer model (1.6M params) achieves 18× speedup with zero failures across 300 seeds and reduced interquartile range with edge initialization.
The authors note that experiments are limited to modular arithmetic on decoder-only transformers, and they are running a 277M-parameter LLM test whose results may not transfer and are not claimed to generalize yet.
Code and PDF are available on GitHub (cliptogrok), and they are seeking arXiv endorsement (cs.LG).

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo

https://preview.redd.it/9hxa34bwhopg1.png?width=3600&format=png&auto=webp&s=909e4e1ba2feebbab94651d125a5c8e7591c4ca6

Zero failures across 300 seeds. 66× speedup. 5 lines of code.

We're two independent researchers. The method: per-row ℓ₂ clipping on decoder weights after every optimizer step. No additional memory, no weight decay needed.

Results on the standard grokking benchmark (modular arithmetic, decoder-only transformer, same setup as Grokfast [2024]):

2-layer (422k params): 66× over AdamW baseline with Lion+Clip
8-layer (1.6M params): 18× over baseline, zero failures across 300 seeds, IQR reduction 61–72% with edge initialization

Honest scope: all experiments are modular arithmetic. We're running a 277M LLM test but it'll take weeks on our hardware and results may not transfer cleanly — we're not claiming otherwise. Happy to share progress, dataset, and full model/training parameters.

Code + PDF:
https://github.com/NiftyliuS/cliptogrok
https://github.com/NiftyliuS/cliptogrok/blob/main/cliptogrok.pdf

We're seeking arXiv endorsement (cs.LG) — DM if willing.

submitted by /u/niftylius
[link] [comments]

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

Reddit r/MachineLearning

My Experience with Qwen 3.5 35B

Reddit r/LocalLLaMA

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4

VentureBeat

Qwen 3.5 122B completely falls apart at ~ 100K context

Reddit r/LocalLLaMA

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo

Key Points

Related Articles

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

My Experience with Qwen 3.5 35B

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4

Qwen 3.5 122B completely falls apart at ~ 100K context

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer