Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

arXiv cs.CL / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MINT (Medical Incremental N-Turn Benchmark), a multi-turn medical diagnosis benchmark with 1,035 cases and labeled evidence “shards” designed to preserve clinically meaningful information across turns.
Evaluations of 11 LLMs on MINT find three recurring behaviors: models often answer before enough evidence is observed, they self-correct more from incorrect-to-correct than the reverse, and they are strongly “lured” by salient evidence (e.g., lab results) into premature commitments.
The study shows that deferring the diagnostic question to later turns can reduce premature answering and improve first-commit accuracy by up to 62.6%.
It also finds that holding salient clinical evidence for later turns can prevent large accuracy degradation—up to a 23.3% drop—associated with premature commitment.
The authors provide both an evaluation framework for realistic multi-turn clinical reasoning and concrete interaction recommendations to improve LLM reliability in diagnostic workflows.

Abstract

Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/7DailyView insight →

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Key Points

Abstract

💡 Insights using this article

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer