InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
arXiv cs.CL / 5/5/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that simply upweighting high-quality data in LLM pretraining can backfire in data-limited or overtraining regimes by increasing repetition and hurting performance.
- It introduces InfoLaw, a data-aware information scaling framework that models training loss using consumed tokens, model size, mixture weights, and repetition rather than relying on standard scaling laws.
- The approach treats pretraining as information accumulation, where data quality affects information density and repetition creates scale-dependent diminishing returns.
- Experiments using varied dataset scales, quality distributions, and repetition levels show InfoLaw can predict loss on unseen data mixtures and scale-up runs (up to 7B parameters and 425B tokens) with low error and robust extrapolation across overtraining.
- By improving how loss changes with data-mixing and repetition choices under different compute budgets, InfoLaw aims to make selecting optimal data recipes more efficient and less underdetermined during scaling.
Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage
TechCrunch

Google, Microsoft, and xAI will allow the US government to review their new AI models
The Verge

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy
Dev.to

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors
TechCrunch