Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

arXiv cs.LG / 2026/3/26

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper argues that simply scaling existing synthetic data augmentation (more synthetic tokens or stronger generators) shows diminishing returns and stays below retrieval-augmented generation (RAG) performance.
  • It introduces “Synthetic Mixed Training,” combining synthetic question-answer pairs with synthetic documents to provide complementary learning signals and achieve log-linear improvements as synthetic volume and generator strength increase.
  • The approach reports outperforming RAG on the QuaLITY long-document reading comprehension benchmark, achieving 2.6% relative gain with an intermediate recipe and 4.4% relative gain using a final recipe that trains a Llama 8B model.
  • A second contribution, “Focal Rewriting,” conditions synthetic document generation on specific questions to improve document diversity and produce a steeper scaling curve.
  • Across multiple benchmarks (QuaLITY, LongHealth, FinanceBench), the authors find models beat RAG in five of six settings, and also report a 9.1% gain when Synthetic Mixed Training is combined with RAG.

Abstract

Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6\% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4\% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6\%, and achieves a 9.1\% gain when combined with RAG.