Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
arXiv cs.CL / 5/1/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper studies a training strategy trade-off for German (and other non-English languages): using highly filtered high-quality data repeated over multiple epochs versus training once on a larger, more diverse but lightly filtered corpus.
- Using hierarchical quality filters over 500M German web documents, the authors compare multi-epoch training on filtered subsets against single-pass training on diverse data across multiple model sizes and token budgets.
- Results show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets, and the advantage remains even after 7 epochs.
- The findings indicate that semantic concentration via quality filtering is a more sample-efficient route for non-English language modeling than maximizing unique data volume.
- The authors release their German models (“Boldt”) and cleaned evaluation benchmarks, reporting state-of-the-art performance while training on 10–360× fewer tokens than comparable models.
Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...
Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia
Dev.to

Every Telegram conversation becomes a qualified lead. BizNode captures name, email, and business details automatically while...
Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development
Dev.to

Pentagon strikes classified AI deals with OpenAI, Google, and Nvidia — but not Anthropic
The Verge