ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- ThReadMed-QA introduces a benchmark of 2,437 fully-answered patient-physician conversations from r/AskDocs, totaling 8,204 QA pairs across up to 9 turns.
- The benchmark uses a physician-grounded, calibrated rubric to evaluate five state-of-the-art LLMs (GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B) on a stratified test subset.
- Results show GPT-5 reaches only 41.2% fully-correct responses, with all models' accuracy deteriorating from turn 0 to turn 2 and wrong-answer rates roughly tripling by turn 3.
- The paper introduces multi-turn failure metrics—Conversational Consistency Score (CCS) and Error Propagation Rate (EPR)—and reveals that stronger initial performers are more prone to steep declines and error propagation in longer dialogues.
Related Articles
I Was Wrong About AI Coding Assistants. Here's What Changed My Mind (and What I Built About It).
Dev.to

Interesting loop
Reddit r/LocalLLaMA
Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants
Reddit r/LocalLLaMA
A supervisor or "manager" Al agent is the wrong way to control Al
Reddit r/artificial
FeatherOps: Fast fp8 matmul on RDNA3 without native fp8
Reddit r/LocalLLaMA