MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
arXiv cs.CL / 3/26/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MedMT-Bench, a new medical multi-turn instruction-following benchmark designed to stress-test long-context memory, robustness to interference, and safety-critical behavior during simulated diagnosis and treatment conversations.
- MedMT-Bench contains 400 test cases with an average of 22 rounds (up to 52), generated through scene-by-scene data synthesis and refined with manual expert editing to match real-world medical workflows.
- The evaluation uses an LLM-as-judge protocol with instance-level rubrics and atomic scoring points, validated against expert annotations with a reported 91.94% human–LLM agreement.
- When tested on 17 frontier models, all systems underperform, with overall accuracy below 60% and the best result at 59.75%, indicating current models still struggle with long multi-turn medical reasoning and instruction adherence.
- The authors position MedMT-Bench as a targeted tool to guide future research toward safer and more reliable medical AI systems.
Related Articles

Black Hat Asia
AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."
Dev.to
Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack
Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to