Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
arXiv cs.LG / 4/13/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies a geometric aspect of LLM pretraining, asking whether models converge to a common minimizer across data sources or simply a minimizer of total summed loss, and links this to downstream generalization.
- It finds that common optimizers like AdamW frequently lead to task-specific minima that are far apart, which may harm out-of-distribution performance.
- The authors propose the Nexus optimizer that increases gradient similarity during training to encourage “closer” task-specific minima despite reaching the same final pretraining loss.
- Experiments across 130M–3B parameter models and multiple data mixtures/hyperparameter schedules show Nexus delivers significant downstream gains, including reported improvements on GSM8k and reduced out-of-distribution loss for the 3B model.
- The work argues that pretraining loss alone is an insufficient proxy for evaluation, highlighting the role of implicit optimization biases in achieving better generalization.
Related Articles

Black Hat Asia
AI Business

I built the missing piece of the MCP ecosystem
Dev.to

When Agents Go Wrong: AI Accountability and the Payment Audit Trail
Dev.to

Google Gemma 4 Review 2026: The Open Model That Runs Locally and Beats Closed APIs
Dev.to

OpenClaw Deep Dive Guide: Self-Host Your Own AI Agent on Any VPS (2026)
Dev.to