Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining
arXiv cs.CL / 3/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how synthetic rewriting interacts with the quality of the original Portuguese source text during continued pretraining, focusing on Portuguese rather than English-only experiments.
- Using ClassiCC-PT quality-scored subsets (10B tokens at different quality levels) rewritten into four styles by a 7B instruction-tuned model, the authors generate ~40B synthetic tokens per condition for training.
- Evaluation on PoETa V2 (44 Portuguese tasks) shows a strong scale-dependent effect: for 7B base models, rewriting high-quality data improves results (+3.4 NPM), while rewriting low-quality data helps much less (+0.5 NPM).
- For smaller 1.1B models, the quality–rewriting interaction is weaker, with unmodified low-quality data performing similarly to rewritten high-quality data.
- Overall, the study concludes that synthetic rewriting functions mainly as a “quality multiplier” rather than replacing the need for data curation, and that benefits depend on model scale.
広告
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested
Dev.to

We built a governance layer for AI-assisted development (with runtime validation and real system)
Dev.to
No AI system using the forward inference pass can ever be conscious.
Reddit r/artificial

What I wish I knew before running AI agents 24/7
Dev.to