Data-efficient pre-training by scaling synthetic megadocs

arXiv cs.LG / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The work studies synthetic data augmentation to improve loss scaling in pre-training, focusing on benefits that grow as compute increases in data-constrained settings.
It shows that mixing web data with synthetically generated rephrases improves i.i.d. validation loss on web data, even though the synthetic data originate from a different distribution.
With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, achieving roughly 1.48x data efficiency at 32 rephrases per document.
The authors introduce megadocs, created by combining synthetic generations from the same document into a single longer document via stitching or stretching with rationales.
Megadocs outperform simple rephrasing in i.i.d. loss, downstream benchmarks, and long-context loss, boosting data efficiency to about 1.80x at 32 generations per document and amplifying gains with more synthetic data.

Abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near

1.48\times

data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from

1.48\times

1.80\times

32

generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Dev.to

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Data-efficient pre-training by scaling synthetic megadocs

Key Points

Abstract

Related Articles

Automating the Chase: AI for Festival Vendor Compliance

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer