Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
arXiv cs.LG / 2026/3/26
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper argues that simply scaling existing synthetic data augmentation (more synthetic tokens or stronger generators) shows diminishing returns and stays below retrieval-augmented generation (RAG) performance.
- It introduces “Synthetic Mixed Training,” combining synthetic question-answer pairs with synthetic documents to provide complementary learning signals and achieve log-linear improvements as synthetic volume and generator strength increase.
- The approach reports outperforming RAG on the QuaLITY long-document reading comprehension benchmark, achieving 2.6% relative gain with an intermediate recipe and 4.4% relative gain using a final recipe that trains a Llama 8B model.
- A second contribution, “Focal Rewriting,” conditions synthetic document generation on specific questions to improve document diversity and produce a steeper scaling curve.
- Across multiple benchmarks (QuaLITY, LongHealth, FinanceBench), the authors find models beat RAG in five of six settings, and also report a 9.1% gain when Synthetic Mixed Training is combined with RAG.



