Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
arXiv cs.CL / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates the role of learning rate scheduling during pre-training of large language models, introducing Warmup-Stable-Only (WSO) which maintains a constant learning rate after warmup with no decay.
- Experiments on 1B and 8B parameter models show that WSO yields better downstream performance after supervised fine-tuning (SFT) than decay-based schedulers, even if those schedulers perform better during pre-training.
- The results hold across different training regimes, including mid-training and over-training, and are supported by loss landscape analysis showing decay schedulers drive sharper minima while WSO preserves flatter minima.
- The findings offer practical guidance for training and release strategies, suggesting pre-training with WSO enhances downstream adaptability of models.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA