Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the shortage of high-quality annotated clinical and especially mental-health data by proposing LLM-driven synthetic data augmentation under privacy constraints.
It uses three LLMs (DeepSeek-R1, OpenBioLLM-Llama3, and Qwen 3.5) to generate ICD-10–conditioned synthetic mental-health evaluation reports.
To avoid common risks like mode collapse and privacy leaks/memorization, the study introduces a multi-dimensional evaluation framework.
Generated outputs are scored on semantic fidelity (clinically consistent meaning), lexical diversity (varied language), and privacy/plagiarism (reduced memorization and copying).
Results indicate the models produce clinically coherent, diverse, and privacy-safe reports, enabling larger training datasets for clinical NLP without breaching confidentiality.

Abstract

The scarcity of high-quality annotated medical data, particularly in mental health, poses a significant bottleneck for training robust machine learning models. Privacy regulations restrict data sharing, making synthetic data generation a promising alternative. The use of Large Language Models (LLMs) in a data augmentation pipeline could be leveraged as an alternative in this field. In the proposed methodology, DeepSeek-R1, OpenBioLLM-Llama3 and Qwen 3.5 are used to generate synthetic mental health evaluation reports conditioned on specific International Classification of Diseases, Tenth Revision (ICD-10) codes. Because naive text generation can lead to mode collapse or privacy breaches (memorization), a comprehensive evaluation framework is introduced. The generated diagnostic texts are assessed across three dimensions: semantic fidelity, lexical diversity, and privacy/plagiarism. The results demonstrate that all models can generate clinically coherent, diverse, and privacy-safe synthetic reports, significantly expanding the available training data for clinical natural language processing tasks without compromising patient confidentiality.