Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

arXiv cs.LG / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how language-model quality changes with dataset size under compute- and architecture-restricted conditions by using a simplified attention-only decoder.
  • Experiments on progressively larger (power-of-two) data subsets show smooth gains that follow scaling-law-like behavior, with clear diminishing returns.
  • The authors report that using roughly 30% of the training data can achieve about 90% of the full-data validation token-level accuracy.
  • Results are framed as practical guidance for deciding how much data to collect and train on when resources are limited, such as in small labs or exploratory development.
  • By isolating dataset-size effects in a component-restricted model, the work aims to clarify scaling-law implications beyond large-scale settings.

Abstract

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.