Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency

arXiv cs.LG / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a finite-size gradient-transport framework for large language model pretraining, using five observables (D, z, β, δ, v_rel) to disentangle cascade size, training duration, absolute transport, and intensive transport efficiency.
Using raw gradient measurements from Pico-LM across multiple scales and 125 aligned steps, and a five-scale companion dataset from Pythia built from 153 aligned checkpoint-difference update fields, the authors find an algebraic closure that holds in both model families.
Despite this shared mathematical structure and a near-unity “cascade-size backbone,” the two families fall into different transport regimes: Pico-LM shows positive scaling with duration and negative scaling with intensive efficiency, while Pythia stays near the D=1 baseline with only weak positive efficiency scaling.
Control experiments with randomized-field baselines produce nearly matched null floors in the intensive and duration channels, suggesting the observed differences reflect real deviations from a shared null structure rather than calibration artifacts.
The work identifies channel-level links to external performance (mainly via v_rel and normalized cascade duration) and argues that D(t) serves as a shared size backbone without strong exponent-level performance association, presenting the framework as reusable rather than claiming a universal fixed point or first-principles neural scaling law derivation.

Abstract

We introduce a finite-size gradient-transport framework for real language-model training, based on five observables

(D,z,\beta,\delta,v_{\mathrm{rel}})

that separate cascade size, duration, absolute transport, and intensive transport efficiency. We analyze direct raw-gradient measurements from Pico-LM across four scales and 125 aligned steps, together with a five-scale Pythia companion dataset built from 153 aligned checkpoint-difference update fields. The same algebraic closure holds in both families, and both share a near-unity cascade-size backbone, but they occupy distinct transport regimes: Pico-LM shows positive duration scaling and negative intensive-efficiency scaling, whereas Pythia remains near the

D=1

baseline with only weak positive efficiency scale dependence. Randomized-field controls give nearly matched null floors in the intensive and duration channels, indicating that the contrast reflects different real departures from a shared null skeleton rather than different null calibrations. The families also differ in stepwise power-law compressibility: Pico-LM retains clean duration and efficiency power laws, whereas Pythia preserves the size backbone but shows weaker one-slope compressibility in those channels. External performance associations are correspondingly channel-level, carried mainly by

v_{\mathrm{rel}}

and normalized cascade duration, while

D(t)

acts as a shared size backbone without a significant exponent-level performance association. These results support a reusable transport measurement framework without claiming a universal fixed point or a first-principles derivation of neural scaling laws.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

Solidity LM surpasses Opus

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer