GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training
arXiv cs.CL / 4/3/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper announces the GPT-NL Public Corpus, a large Dutch-first dataset of permissively licensed language resources intended for LLM pre-training.
- The dataset includes 21 Dutch-only collections totaling 36B preprocessed Dutch tokens, plus additional 207B English, 232B code, and 48B German/Danish tokens curated for compliance.
- Dutch data is sourced from both curated versions of existing corpora (e.g., Common Crawl/Common Corpus) and newly created Dutch-specific collections, which may involve organizational collaboration or synthetic augmentation.
- All included data is redistributed under a CC-BY license, with licensing, curation, and evaluation aimed at enabling lawful, useful, and non-harmful commercial language model development.
- The full dataset is made publicly available via the Hugging Face Hub.

