Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how synthetic rewriting interacts with the quality of the original Portuguese source text during continued pretraining, focusing on Portuguese rather than English-only experiments.
Using ClassiCC-PT quality-scored subsets (10B tokens at different quality levels) rewritten into four styles by a 7B instruction-tuned model, the authors generate ~40B synthetic tokens per condition for training.
Evaluation on PoETa V2 (44 Portuguese tasks) shows a strong scale-dependent effect: for 7B base models, rewriting high-quality data improves results (+3.4 NPM), while rewriting low-quality data helps much less (+0.5 NPM).
For smaller 1.1B models, the quality–rewriting interaction is weaker, with unmodified low-quality data performing similarly to rewritten high-quality data.
Overall, the study concludes that synthetic rewriting functions mainly as a “quality multiplier” rather than replacing the need for data curation, and that benefits depend on model scale.

Abstract

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Dev.to

We built a governance layer for AI-assisted development (with runtime validation and real system)

Dev.to

No AI system using the forward inference pass can ever be conscious.

Reddit r/artificial

What I wish I knew before running AI agents 24/7

Dev.to

Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

We built a governance layer for AI-assisted development (with runtime validation and real system)

No AI system using the forward inference pass can ever be conscious.

What I wish I knew before running AI agents 24/7

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer