Scaling Laws are Redundancy Laws

arXiv stat.ML / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that deep learning scaling laws can be derived as redundancy laws rather than having an unknown mathematical origin for the scaling exponent.
  • By applying kernel regression, it links the excess-risk power-law exponent to properties of the data covariance spectrum, introducing a redundancy measure (1/beta) that affects the learning-curve slope.
  • The authors find that the slope of learning curves is not universal and varies with data redundancy, with steeper covariance spectra leading to faster returns to scale.
  • They claim broad universality of the resulting law across boundedly invertible transformations, multimodal mixture data, finite-width approximations, and Transformer models in both NTK/linearized and feature-learning regimes.
  • The work positions itself as the first rigorous finite-sample mathematical explanation that unifies empirical scaling-law observations with theoretical foundations grounded in data redundancy.

Abstract

Scaling laws, a defining feature of deep learning, reveal a striking power-law improvement in model performance with increasing dataset and model size. Yet, their mathematical origins, especially the scaling exponent, have remained elusive. In this work, we show that scaling laws can be formally explained as redundancy laws. Using kernel regression, we show that a polynomial tail in the data covariance spectrum yields an excess risk power law with exponent alpha = 2s / (2s + 1/beta), where beta controls the spectral tail and 1/beta measures redundancy. This reveals that the learning curve's slope is not universal but depends on data redundancy, with steeper spectra accelerating returns to scale. We establish the law's universality across boundedly invertible transformations, multi-modal mixtures, finite-width approximations, and Transformer architectures in both linearized (NTK) and feature-learning regimes. This work delivers the first rigorous mathematical explanation of scaling laws as finite-sample redundancy laws, unifying empirical observations with theoretical foundations.