GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training

arXiv cs.CL / 4/3/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper announces the GPT-NL Public Corpus, a large Dutch-first dataset of permissively licensed language resources intended for LLM pre-training.
The dataset includes 21 Dutch-only collections totaling 36B preprocessed Dutch tokens, plus additional 207B English, 232B code, and 48B German/Danish tokens curated for compliance.
Dutch data is sourced from both curated versions of existing corpora (e.g., Common Crawl/Common Corpus) and newly created Dutch-specific collections, which may involve organizational collaboration or synthetic augmentation.
All included data is redistributed under a CC-BY license, with licensing, curation, and evaluation aimed at enabling lawful, useful, and non-harmful commercial language model development.
The full dataset is made publicly available via the Hugging Face Hub.

Abstract

We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus and Common Crawl, as well as newly created Dutch-specific collections. Most newly created Dutch collections consist of content collected in collaboration with organisations or synthetically augmented content. All data is collected and evaluated with the aim of facilitating the creation of (commercial) language models that are lawful, useful and non-harmful. All data included in the GPT-NL Public Corpus is sourced from datasets with permissive licensing and is curated and redistributed under a CC-BY license. The full dataset is publicly available on the Hugging Face Hub.

Black Hat USA

AI Business

Black Hat Asia

AI Business

How to Build Self-Running AI Tasks with TypeScript (No Cron Jobs Needed)

Dev.to

The Sentinel: AI-Powered Zero-Touch Insurance for Gig Workers

Dev.to

From Crisis to Clinic: How AI Automates Drug Shortage Resolution

Dev.to

GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

How to Build Self-Running AI Tasks with TypeScript (No Cron Jobs Needed)

The Sentinel: AI-Powered Zero-Touch Insurance for Gig Workers

From Crisis to Clinic: How AI Automates Drug Shortage Resolution

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer