Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper studies a training strategy trade-off for German (and other non-English languages): using highly filtered high-quality data repeated over multiple epochs versus training once on a larger, more diverse but lightly filtered corpus.
Using hierarchical quality filters over 500M German web documents, the authors compare multi-epoch training on filtered subsets against single-pass training on diverse data across multiple model sizes and token budgets.
Results show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets, and the advantage remains even after 7 epochs.
The findings indicate that semantic concentration via quality filtering is a more sample-efficient route for non-English language modeling than maximizing unique data volume.
The authors release their German models (“Boldt”) and cleaned evaluation benchmarks, reporting state-of-the-art performance while training on 10–360× fewer tokens than comparable models.

Abstract

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Dev.to

Every Telegram conversation becomes a qualified lead. BizNode captures name, email, and business details automatically while...

Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Dev.to

Pentagon strikes classified AI deals with OpenAI, Google, and Nvidia — but not Anthropic

The Verge

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Key Points

Abstract

Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Every Telegram conversation becomes a qualified lead. BizNode captures name, email, and business details automatically while...

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Pentagon strikes classified AI deals with OpenAI, Google, and Nvidia — but not Anthropic

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer