EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The EuropeMedQA study protocol introduces a new multilingual, multimodal medical examination dataset built from official regulatory exams across Italy, France, Spain, and Portugal.
  • It targets a key gap in current LLM medical evaluations by covering non-English performance drops and multimodal diagnostic/visual reasoning tasks.
  • The protocol specifies a rigorous data curation process aligned with FAIR data principles and SPIRIT-AI guidelines, along with an automated translation pipeline for cross-language comparison.
  • It plans to evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting approach to measure cross-lingual transfer and visual reasoning.
  • The benchmark is designed to be contamination-resistant and more representative of European clinical practices to support development of more generalizable medical AI.

Abstract

While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.