Language corpora for the Dutch medical domain

arXiv cs.CL / 4/29/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The paper addresses a major gap in Dutch medical language resources, noting that limited corpora have constrained NLP development in the domain.
  • It builds a new Dutch medical corpus by translating English datasets, mining medical text from broader generic corpora, and collecting open Dutch medical resources.
  • The resulting dataset is large, with approximately 35 billion tokens across about 100 million documents, and it is released freely on Hugging Face.
  • The authors position the corpus as a foundational resource for both pre-training and downstream Dutch medical NLP tasks.

Abstract

\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises \pm 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.