Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how to choose optimal model size when training is constrained by wall-clock time (5 minutes to 24 hours) rather than FLOPs, using consumer GPUs (RTX 4090) across 70+ runs from 50M to 1031M parameters.
It finds a U-shaped performance curve at each fixed time budget: models that are too small overfit while models that are too large undertrain, indicating an intermediate “sweet spot” per time budget.
The authors derive that the optimal parameter count scales as N* ∝ t^0.60, which grows faster than prior compute-based scaling (e.g., Chinchilla’s N* ∝ C^0.50), with α = 0.60 ± 0.07 remaining robust across sensitivity analyses.
The paper proposes a dual-mechanism explanation for the observed behavior: short-budget U-curves come from compute bottlenecks, while long-budget U-curves come from data bottlenecks/overfitting, with a mid regime where the U-curve can temporarily vanish.
All code, logs, and experimental configurations are released, aiming to directly inform practitioners training on consumer hardware where elapsed time is the binding constraint.

Abstract

Scaling laws relate model quality to compute budget (FLOPs), but practitioners face wall-clock time constraints, not compute budgets. We study optimal model sizing under fixed time budgets from 5 minutes to 24 hours on consumer GPUs (RTX 4090). Across 70+ runs spanning 50M--1031M parameters, we find: (1)~at each time budget a U-shaped curve emerges where too-small models overfit and too-large models undertrain; (2)~optimal model size follows

N^* \propto t^{0.60}

, growing \emph{faster} than Chinchilla's

N^* \propto C^{0.50}

, with

\alpha = 0.60 \pm 0.07

robustly exceeding compute-optimal across all sensitivity analyses; (3)~a \emph{dual U-shape mechanism}: short-budget U-curves arise from compute bottlenecks, while long-budget U-curves emerge from data bottlenecks (overfitting), with an intermediate regime where the U-curve temporarily disappears. These findings have immediate implications for researchers training on consumer hardware, where wall-clock time -- not FLOPs -- is the binding constraint. We release all code, logs, and 70+ experimental configurations.