[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task

Reddit r/MachineLearning / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Researchers report an update to their “Clip to Grok” approach, extending weight norm clipping from modular multiplication to six broader algebraic tasks including mixed modular operations and an S5 permutation composition task.
  • The method remains per-row ℓ2 clipping on decoder weights after every optimizer step, with no weight decay or additional memory overhead, implemented via the provided norms.py code.
  • Across tasks, clipping dramatically reduces the median steps needed to reach 95% validation accuracy, with reported speedups ranging from roughly 39× to 87× compared with an AdamW baseline.
  • They identify task-specific optimal max_norm values (e.g., 2.0 for mul mod 97 down to 1.5–1.75 for several other tasks), and they include max_norm ablation/measurement details per task.
  • The expanded benchmark includes four single-operation mod 97 tasks, one all-mod mixed single dataset, and a non-abelian S5 permutation setup with 120 elements.
[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task

Seed 0 results on mul mod -97, mixed add,sub,mul and div mode p97 and S5 permutation with max norm ablation

Update to our previous post. We're two independent researchers.

Since the last post we expanded from modular multiplication to six algebraic tasks:

  • Four modular arithmetic operations (addition, subtraction, multiplication, division mod 97)
  • Mixed task of all four (addition, subtraction, multiplication and division) as all-mod single dataset
  • S5 permutation composition (non-abelian, 120 elements).

Method (unchanged): per-row ℓ₂ clipping on decoder weights after every optimizer step. No weight decay, no extra memory. Implementation: norms.py

Median steps to 95% val accuracy (Lion+Clip, n=100 seeds per value per task, optimal max_norm per task):

Task Median [95% CI] AdamW baseline Seed 0 speedup max_norm
mul mod 97 550 [530–560] 35,040 66× 2.0
add mod 97 570 [555–590] 40,240 69× 1.75
sub mod 97 775 [740–870] 57,670 87× 1.5
div mod 97 730 [700–790] 71,160 39× 1.75
all-mod (mixed) 3,090 [2880–3300] 86,400 50× 1.75
S5 permutation 1,348 [1252–1424] 390,896 249× 1.0

The S5 result surprised us. The baseline takes 390,896 steps. Lion+Clip median is 1,348. The non-abelian structure forced a tighter clipping radius — S5 is sharply optimal at max_norm=1.0 and degrades fast above 1.25, while modular multiplication is happy at 2.0.

The most interesting finding: max_norm correlates with algebraic complexity. Inverse-dependent operations (div, sub) favor 1.5–1.75. Direct operations (mul, add) tolerate up to 2.0. Mixed and non-abelian tasks pull tighter. The bottom-right panel shows this across all three task types, n=100 seeds per value.

Total experiments:

Adam Lion SignSGD Total
Runs 2,126 7,137 2,125
Unique Seeds 821 2,521 822

including baselines

Honest scope: all experiments are algebraic tasks (modular arithmetic and permutation groups). Results may not transfer to other domains — we're not claiming otherwise.

Code + PDF:
https://github.com/NiftyliuS/cliptogrok
https://github.com/NiftyliuS/cliptogrok/blob/main/cliptogrok.pdf

An implementation is also available in fast-weight-attention by lucidrains.

We're still seeking arXiv endorsement (cs.LG) — DM if willing.

submitted by /u/niftylius
[link] [comments]