Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how synthetic rewriting interacts with the quality of the original Portuguese source text during continued pretraining, focusing on Portuguese rather than English-only experiments.
  • Using ClassiCC-PT quality-scored subsets (10B tokens at different quality levels) rewritten into four styles by a 7B instruction-tuned model, the authors generate ~40B synthetic tokens per condition for training.
  • Evaluation on PoETa V2 (44 Portuguese tasks) shows a strong scale-dependent effect: for 7B base models, rewriting high-quality data improves results (+3.4 NPM), while rewriting low-quality data helps much less (+0.5 NPM).
  • For smaller 1.1B models, the quality–rewriting interaction is weaker, with unmodified low-quality data performing similarly to rewritten high-quality data.
  • Overall, the study concludes that synthetic rewriting functions mainly as a “quality multiplier” rather than replacing the need for data curation, and that benefits depend on model scale.

Abstract

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.
広告