| Update to our previous post. We're two independent researchers. Since the last post we expanded from modular multiplication to six algebraic tasks:
Method (unchanged): per-row ℓ₂ clipping on decoder weights after every optimizer step. No weight decay, no extra memory. Implementation: norms.py Median steps to 95% val accuracy (Lion+Clip, n=100 seeds per value per task, optimal max_norm per task):
The S5 result surprised us. The baseline takes 390,896 steps. Lion+Clip median is 1,348. The non-abelian structure forced a tighter clipping radius — S5 is sharply optimal at max_norm=1.0 and degrades fast above 1.25, while modular multiplication is happy at 2.0. The most interesting finding: max_norm correlates with algebraic complexity. Inverse-dependent operations (div, sub) favor 1.5–1.75. Direct operations (mul, add) tolerate up to 2.0. Mixed and non-abelian tasks pull tighter. The bottom-right panel shows this across all three task types, n=100 seeds per value. Total experiments:
including baselines Honest scope: all experiments are algebraic tasks (modular arithmetic and permutation groups). Results may not transfer to other domains — we're not claiming otherwise. Code + PDF: An implementation is also available in fast-weight-attention by lucidrains. We're still seeking arXiv endorsement (cs.LG) — DM if willing. [link] [comments] |
[P] Clip to Grok Update: Weight Norm Clipping now 39–249× | 6 Tasks (mod arithmetic, mixed ops, S5 permutation) | max_norm Measured Per Task
Reddit r/MachineLearning / 4/2/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Researchers report an update to their “Clip to Grok” approach, extending weight norm clipping from modular multiplication to six broader algebraic tasks including mixed modular operations and an S5 permutation composition task.
- The method remains per-row ℓ2 clipping on decoder weights after every optimizer step, with no weight decay or additional memory overhead, implemented via the provided norms.py code.
- Across tasks, clipping dramatically reduces the median steps needed to reach 95% validation accuracy, with reported speedups ranging from roughly 39× to 87× compared with an AdamW baseline.
- They identify task-specific optimal max_norm values (e.g., 2.0 for mul mod 97 down to 1.5–1.75 for several other tasks), and they include max_norm ablation/measurement details per task.
- The expanded benchmark includes four single-operation mod 97 tasks, one all-mod mixed single dataset, and a non-abelian S5 permutation setup with 120 elements.
Related Articles

Black Hat Asia
AI Business

Z.ai Launches GLM-5V-Turbo: A Native Multimodal Vision Coding Model Optimized for OpenClaw and High-Capacity Agentic Engineering Workflows Everywhere
MarkTechPost

How I Started Using AI Agents for End-to-End Testing (Autonoma AI)
Dev.to

How We Built an AI Coach That Understands PTSD — And Why It Matters
Dev.to

How AI Is Changing PTSD Recovery — And Why It Matters
Dev.to