daVinci-LLM:Towards the Science of Pretraining
arXiv cs.AI / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that foundational pretraining largely determines a language model’s ultimate capability ceiling, and that this phase is still insufficiently studied compared with post-training.
- It introduces daVinci-LLM as an effort to combine industrial-scale compute with full academic research freedom, using a fully open release of data processing pipelines, training processes, and exploration results.
- The authors use the Data Darwinism framework (an L0–L9 taxonomy from filtering through synthesis) to systematically structure and study how data processing choices affect pretraining outcomes.
- They train a 3B-parameter model from scratch across 8T tokens with a two-stage adaptive curriculum, and run 200+ controlled ablations to quantify key drivers such as processing depth, domain-specific saturation dynamics, and compositional balance.
- The study also highlights that evaluation protocol design can change how progress is interpreted, and the authors aim to enable cumulative “science of pretraining” through reproducible methodology.


