MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

arXiv cs.CL / 3/18/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

MedArena is an interactive platform that lets clinicians compare leading LLMs on their own real-world medical queries, addressing shortcomings of static benchmarks.
Across 1571 clinician preferences from 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o emerged as the top models by Bradley-Terry rating.
Most clinician prompts involved treatment decisions, clinical documentation, or patient communication rather than factual recall, with ~20% involving multi-turn conversations.
The study finds model rankings remain stable after adjusting for style factors like response length and formatting, supporting MedArena as a scalable, real-world evaluation approach for medical LLMs.

Abstract

Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

I made a new programming language to get better coding with less tokens.

Dev.to

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Dev.to

MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

I made a new programming language to get better coding with less tokens.

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Why I Switched From GPT-4 to Small Language Models for Two of My Products

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer