LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes an LLM-driven pipeline to generate synthetic French OSCE doctor–patient dialogue transcripts in a low-resource setting where real annotated data is scarce.
It also uses LLM-assisted “silver labeling” to automatically evaluate the dialogues against scenario-specific clinical skills criteria, including a mix of ideal and perturbed performances to emulate different student proficiency levels.
Benchmarking across several open-source and proprietary LLMs finds that mid-size models (≤32B parameters) can reach accuracy on synthetic data comparable to GPT-4o (~90%), suggesting strong practical feasibility.
The authors argue the approach could enable privacy-preserving, locally deployable training-time evaluation systems that reduce reliance on human examiners for repeated practice and feedback in French medical education.
The work is positioned as a controlled way to create reproducible benchmarks for French OSCE assessment research despite the lack of real annotated transcripts.

Abstract

Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models (

\le

32B parameters) achieve accuracies comparable to GPT-4o (

\sim

90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.

Black Hat USA

AI Business

Black Hat Asia

AI Business

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer