Generating High Quality Synthetic Data for Dutch Medical Conversations

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 研究は、プライバシーや倫理的制約で入手困難なオランダ語の医療対話データ不足を補うため、実会話を参照しながらLLMで合成オランダ語医療対話を生成するパイプラインを提案しています。
  • 生成物は定量評価(語彙多様性など)と母語話者・医療従事者による定性レビューで評価され、語彙の多様性は高い一方でターンテイキングが過度に規則的で、台本的な会話になりがちな点が示されました。
  • 定性評価では平均よりやや低いスコアとなり、領域特異性や自然な表現に課題があると指摘されています。
  • 数値指標と人手評価の相関が限定的であることから、会話の言語品質は数値メトリクスだけでは十分に捉えられないと結論づけています。
  • 合成対話生成は実現可能だが、自然さと会話構造のバランスを取るためにドメイン知識と入念なプロンプト設計が重要だと述べ、オランダの臨床NLP資源拡充の基盤を提供します。

Abstract

Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores, with raters noting issues in domain specificity and natural expression. The limited correlation between quantitative and qualitative results highlights that numerical metrics alone cannot fully capture linguistic quality. Our findings demonstrate that generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure in conversation. This work provides a foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.