PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

arXiv cs.CL / 3/24/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • PARHAF は、個人情報保護の制約で実臨床データ共有が難しい問題に対し、フランス語の臨床文書を「完全に架空の患者ケース」として人手で作成したオープンコーパスを提供する。
  • 104人の医療研修医が18の専門領域で、SNDS(フランス国家保健データシステム)の疫学的ガイダンスと定型テンプレートに基づく構造化プロトコルで7394件の臨床レポート(5009件の患者ケース)を作成し、ピアレビューも実施した。
  • コーパスは、実際の入院分布に近づける汎用部分に加え、オンコロジー、感染症、診断コーディングの情報抽出用途を支える4つの専門サブセットを含む。
  • CC-BYライセンスで公開され、一部は将来のベンチマークのために一時的にエンバーゴされるなど、プライバシーを保った学習・評価を可能にする設計になっている。
  • PARHAF はフランスの臨床言語モデルの訓練・評価に有用なだけでなく、他言語や他の医療制度でも再現可能な合成臨床コーパス作成手法を示す。

Abstract

The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.