Learning Rate Engineering: From Coarse Single Parameter to Layered Evolution
arXiv cs.AI / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper traces learning-rate scheduling’s evolution across five generations, from global fixed SGD rates to joint layer-and-time scheduling that adapts updates by depth and training phase.
- It explains the motivation for finer-grained scheduling through the “impossible trinity” in transfer learning: lower layers need small changes to retain general features while higher layers require larger updates to learn new tasks.
- The authors introduce Discriminative Adaptive Layer Scaling (DALS), combining phase-adaptive cosine scheduling, depth-aware Grokfast-style gradient filtering, and LARS-like trust ratios into one optimizer framework.
- Benchmarks across 18 learning-rate/optimizer strategies (including DALS variants) on synthetic data, CIFAR-10 (from scratch), RTE, TREC-6, and IMDb (fine-tuning) show DALS delivers the best synthetic accuracy (98.0%), while DALS-Fast reaches 90% in 3 epochs.
- Cross-dataset results reveal regime-dependent winners and highlight that some directional-decay methods (e.g., STLR+Discriminative/ULMFiT) can catastrophically fail on from-scratch tasks without pretrained representations.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER