Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency

arXiv cs.LG / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a finite-size gradient-transport framework for large language model pretraining, using five observables (D, z, β, δ, v_rel) to disentangle cascade size, training duration, absolute transport, and intensive transport efficiency.
  • Using raw gradient measurements from Pico-LM across multiple scales and 125 aligned steps, and a five-scale companion dataset from Pythia built from 153 aligned checkpoint-difference update fields, the authors find an algebraic closure that holds in both model families.
  • Despite this shared mathematical structure and a near-unity “cascade-size backbone,” the two families fall into different transport regimes: Pico-LM shows positive scaling with duration and negative scaling with intensive efficiency, while Pythia stays near the D=1 baseline with only weak positive efficiency scaling.
  • Control experiments with randomized-field baselines produce nearly matched null floors in the intensive and duration channels, suggesting the observed differences reflect real deviations from a shared null structure rather than calibration artifacts.
  • The work identifies channel-level links to external performance (mainly via v_rel and normalized cascade duration) and argues that D(t) serves as a shared size backbone without strong exponent-level performance association, presenting the framework as reusable rather than claiming a universal fixed point or first-principles neural scaling law derivation.

Abstract

We introduce a finite-size gradient-transport framework for real language-model training, based on five observables (D,z,\beta,\delta,v_{\mathrm{rel}}) that separate cascade size, duration, absolute transport, and intensive transport efficiency. We analyze direct raw-gradient measurements from Pico-LM across four scales and 125 aligned steps, together with a five-scale Pythia companion dataset built from 153 aligned checkpoint-difference update fields. The same algebraic closure holds in both families, and both share a near-unity cascade-size backbone, but they occupy distinct transport regimes: Pico-LM shows positive duration scaling and negative intensive-efficiency scaling, whereas Pythia remains near the D=1 baseline with only weak positive efficiency scale dependence. Randomized-field controls give nearly matched null floors in the intensive and duration channels, indicating that the contrast reflects different real departures from a shared null skeleton rather than different null calibrations. The families also differ in stepwise power-law compressibility: Pico-LM retains clean duration and efficiency power laws, whereas Pythia preserves the size backbone but shows weaker one-slope compressibility in those channels. External performance associations are correspondingly channel-level, carried mainly by v_{\mathrm{rel}} and normalized cascade duration, while D(t) acts as a shared size backbone without a significant exponent-level performance association. These results support a reusable transport measurement framework without claiming a universal fixed point or a first-principles derivation of neural scaling laws.