Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces a Retrieval-Reasoning framework that uses few-shot prompting with an LLM to generate synthetic clinical trial reports with binary success/failure outcomes.
  • It combines a retrieval module to ground generation in relevant ClinicalTrials.gov data and a reasoning module to produce domain-consistent justifications.
  • Experiments on real trials from ClinicalTrials.gov show that the synthetic trials can augment real datasets effectively.
  • The authors fine-tune a BioBERT classifier using synthetic data, real data, or mixtures, finding that hybrid fine-tuning improves clinical trial outcome prediction performance.
  • The work argues that LLM-generated synthetic trials could support privacy-preserving data augmentation for clinical research, and releases accompanying code on GitHub.

Abstract

Machine learning (ML) holds great promise for clinical applications but is often hindered by limited access to high-quality data due to privacy concerns, high costs, and long timelines associated with clinical trials. While large language models (LLMs) have demonstrated strong performance in general-purpose generation tasks, their application to synthesizing realistic clinical trials remains underexplored. In this work, we propose a novel Retrieval-Reasoning framework that leverages few-shot prompting with LLMs to generate synthetic clinical trial reports annotated with binary success/failure outcomes. Our approach integrates a retrieval module to ground the generation on relevant trial data and a reasoning module to ensure domain-consistent justifications. Experiments conducted on real clinical trials from the ClinicalTrials.gov database demonstrate that the generated synthetic trials effectively augment real datasets. Fine-tuning a BioBERT classifier on synthetic data, real data, or their combination shows that hybrid fine-tuning leads to improved performance on clinical trial outcome prediction tasks. Our results suggest that LLM-based synthetic data can serve as a powerful tool for privacy-preserving data augmentation in clinical research. The code is available at https://github.com/XuZR3x/Retrieval_Reasoning_Clinical_Trial_Generation.
広告