Prescriptive Scaling Laws for Data Constrained Training

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a regime where training compute is abundant but high-quality data is scarce, shifting the focus from compute allocation to maximizing value from limited data.
It critiques the commonly used Chinchilla scaling law because it assumes all training tokens are unique, making it less reliable when repetition occurs.
The authors derive a new scaling law by modeling excess loss under token repetition as a simple additive overfitting penalty, which they find matches observed model behavior.
The new law recommends qualitatively different compute-optimal allocation: beyond a threshold, additional repetition hurts and compute should be redirected toward increasing model capacity.
The one-parameter formulation isolates overfitting into a single coefficient, enabling direct comparison across training setups, and the study shows that strong weight decay (λ=1.0) can reduce this overfitting coefficient by about 70%, offering an explanation for why optimal weight decay is much larger than standard practice in data-constrained settings.

Abstract

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay (

\lambda=1.0

) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.