ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- ThReadMed-QA introduces a benchmark of 2,437 fully-answered patient-physician conversations from r/AskDocs, totaling 8,204 QA pairs across up to 9 turns.
- The benchmark uses a physician-grounded, calibrated rubric to evaluate five state-of-the-art LLMs (GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B) on a stratified test subset.
- Results show GPT-5 reaches only 41.2% fully-correct responses, with all models' accuracy deteriorating from turn 0 to turn 2 and wrong-answer rates roughly tripling by turn 3.
- The paper introduces multi-turn failure metrics—Conversational Consistency Score (CCS) and Error Propagation Rate (EPR)—and reveals that stronger initial performers are more prone to steep declines and error propagation in longer dialogues.




