Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that simply scaling existing synthetic data augmentation (more synthetic tokens or stronger generators) shows diminishing returns and stays below retrieval-augmented generation (RAG) performance.
It introduces “Synthetic Mixed Training,” combining synthetic question-answer pairs with synthetic documents to provide complementary learning signals and achieve log-linear improvements as synthetic volume and generator strength increase.
The approach reports outperforming RAG on the QuaLITY long-document reading comprehension benchmark, achieving 2.6% relative gain with an intermediate recipe and 4.4% relative gain using a final recipe that trains a Llama 8B model.
A second contribution, “Focal Rewriting,” conditions synthetic document generation on specific questions to improve document diversity and produce a steeper scaling curve.
Across multiple benchmarks (QuaLITY, LongHealth, FinanceBench), the authors find models beat RAG in five of six settings, and also report a 9.1% gain when Synthetic Mixed Training is combined with RAG.

Abstract

Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6\% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4\% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6\%, and achieves a 9.1\% gain when combined with RAG.

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

Dev.to

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

"Why Your AI Agent Needs a System 1"

Dev.to

When should we expect TurboQuant?

Reddit r/LocalLLaMA

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

Dev.to

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Key Points

Abstract

Related Articles

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

AgentDesk vs Hiring Another Consultant: A Cost Comparison

"Why Your AI Agent Needs a System 1"

When should we expect TurboQuant?

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer