Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
arXiv cs.CL / 4/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a German-language LLM data curation pipeline that blends heuristic filtering, model-based filtering, and synthetic data generation to improve training efficiency and downstream performance.
- It presents Aleph-Alpha-GermanWeb, a 628B-word German pre-training dataset made from organic Common Crawl and FineWeb2 subsets plus a synthetic subset generated using actual organic web data as conditioning.
- The authors evaluate the dataset by training a 1B “Llama-style” model and an 8B tokeniser-free hierarchical autoregressive transformer (HAT) from scratch, then testing on German benchmarks including MMMLU.
- Results show that Aleph-Alpha-GermanWeb yields significant gains over FineWeb2 alone, and these gains persist even at the 8B scale when FineWeb2 is enhanced with human-curated sources like Wikipedia.
- The study concludes that model-based curation and synthetic data generation can materially improve German LLM pre-training datasets, supporting broader evidence from similar work in other languages/domains.
Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to
[P] Federated Adversarial Learning
Reddit r/MachineLearning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility
Towards Data Science