GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training

arXiv cs.CL / 4/3/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper announces the GPT-NL Public Corpus, a large Dutch-first dataset of permissively licensed language resources intended for LLM pre-training.
  • The dataset includes 21 Dutch-only collections totaling 36B preprocessed Dutch tokens, plus additional 207B English, 232B code, and 48B German/Danish tokens curated for compliance.
  • Dutch data is sourced from both curated versions of existing corpora (e.g., Common Crawl/Common Corpus) and newly created Dutch-specific collections, which may involve organizational collaboration or synthetic augmentation.
  • All included data is redistributed under a CC-BY license, with licensing, curation, and evaluation aimed at enabling lawful, useful, and non-harmful commercial language model development.
  • The full dataset is made publicly available via the Hugging Face Hub.

Abstract

We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus and Common Crawl, as well as newly created Dutch-specific collections. Most newly created Dutch collections consist of content collected in collaboration with organisations or synthetically augmented content. All data is collected and evaluated with the aim of facilitating the creation of (commercial) language models that are lawful, useful and non-harmful. All data included in the GPT-NL Public Corpus is sourced from datasets with permissive licensing and is curated and redistributed under a CC-BY license. The full dataset is publicly available on the Hugging Face Hub.