Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

arXiv cs.LG / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a zero-shot, knowledge-guided Retrieval-Augmented Generation (RAG) framework to generate synthetic psychiatric tabular data when real patient datasets are unavailable.
It steers LLMs using DSM-5 and ICD-10 as knowledge bases via RAG, then benchmarks the generated data against CTGAN and TVAE, which require real data and may introduce privacy risk.
Experiments on six anxiety-related disorders find that CTGAN usually performs best on marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and achieves the lowest pairwise error for separation anxiety and social anxiety.
An ablation study suggests clinical retrieval improves univariate and pairwise fidelity versus an LLM without retrieval, and privacy analyses indicate the no-real-data LLM has modest overlaps and low average linkage risk, with TVAE showing more duplication despite some metrics.
Overall, the authors conclude that grounding LLMs in clinical taxonomies can produce high-quality, privacy-preserving synthetic psychiatric datasets suitable for healthcare research workflows that can’t share real data.

Abstract

AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

The Redline Economy

Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks

Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

Dev.to

Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

Key Points

Abstract

Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

The Redline Economy

$500 GPU outperforms Claude Sonnet on coding benchmarks

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer