InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that simply upweighting high-quality data in LLM pretraining can backfire in data-limited or overtraining regimes by increasing repetition and hurting performance.
  • It introduces InfoLaw, a data-aware information scaling framework that models training loss using consumed tokens, model size, mixture weights, and repetition rather than relying on standard scaling laws.
  • The approach treats pretraining as information accumulation, where data quality affects information density and repetition creates scale-dependent diminishing returns.
  • Experiments using varied dataset scales, quality distributions, and repetition levels show InfoLaw can predict loss on unseen data mixtures and scale-up runs (up to 7B parameters and 425B tokens) with low error and robust extrapolation across overtraining.
  • By improving how loss changes with data-mixing and repetition choices under different compute budgets, InfoLaw aims to make selecting optimal data recipes more efficient and less underdetermined during scaling.

Abstract

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition | AI Navigate