MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

arXiv cs.CL / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MedMT-Bench, a new medical multi-turn instruction-following benchmark designed to stress-test long-context memory, robustness to interference, and safety-critical behavior during simulated diagnosis and treatment conversations.
  • MedMT-Bench contains 400 test cases with an average of 22 rounds (up to 52), generated through scene-by-scene data synthesis and refined with manual expert editing to match real-world medical workflows.
  • The evaluation uses an LLM-as-judge protocol with instance-level rubrics and atomic scoring points, validated against expert annotations with a reported 91.94% human–LLM agreement.
  • When tested on 17 frontier models, all systems underperform, with overall accuracy below 60% and the best result at 59.75%, indicating current models still struggle with long multi-turn medical reasoning and instruction adherence.
  • The authors position MedMT-Bench as a targeted tool to guide future research toward safer and more reliable medical AI systems.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material