Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
arXiv cs.CL / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates the role of learning rate scheduling during pre-training of large language models, introducing Warmup-Stable-Only (WSO) which maintains a constant learning rate after warmup with no decay.
- Experiments on 1B and 8B parameter models show that WSO yields better downstream performance after supervised fine-tuning (SFT) than decay-based schedulers, even if those schedulers perform better during pre-training.
- The results hold across different training regimes, including mid-training and over-training, and are supported by loss landscape analysis showing decay schedulers drive sharper minima while WSO preserves flatter minima.
- The findings offer practical guidance for training and release strategies, suggesting pre-training with WSO enhances downstream adaptability of models.
Related Articles
How We Built ScholarNet AI: An AI-Powered Study Platform for Students
Dev.to
Using Notion MCP: Building a Personal AI 'OS' to Claim Back Your Morning
Dev.to
The LiteLLM Attack Exposed a Bigger Problem: Your Vibe-Coded App Probably Has the Same Vulnerabilities
Dev.to
Why Your Claude-Assisted Project Falls Apart After Week 3 (And How to Fix It)
Dev.to
Avoiding Over-smoothing in Social Media Rumor Detection with Pre-trained Propagation Tree Transformer
arXiv cs.CL