This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how patient question phrasing affects large language model (LLM) responses in medical QA using a controlled retrieval-augmented generation (RAG) setup with expert-selected documents.
Using a dataset of 6,614 query pairs based on clinical trial abstracts, the authors compare effects of question framing (positive vs. negative) and language style (technical vs. plain).
Results show that positive/negative framing pairs are significantly more likely to yield contradictory conclusions than same-framing pairs, indicating sensitivity to framing even with identical underlying evidence.
The inconsistency is amplified in multi-turn conversations, where continued interaction increases persuasion-driven divergence.
The study finds no significant interaction between framing and language style, and concludes that phrasing robustness should be a key evaluation criterion for high-stakes RAG medical systems.

Abstract

Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

Black Hat Asia

AI Business

Meta's latest model is as open as Zuckerberg's private school

The Register

AI fuels global trade growth as China-US flows shift, McKinsey finds

SCMP Tech

Why multi-agent AI security is broken (and the identity patterns that actually work)

Dev.to

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

Reddit r/artificial

This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Meta's latest model is as open as Zuckerberg's private school

AI fuels global trade growth as China-US flows shift, McKinsey finds

Why multi-agent AI security is broken (and the identity patterns that actually work)

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer