This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how patient question phrasing affects large language model (LLM) responses in medical QA using a controlled retrieval-augmented generation (RAG) setup with expert-selected documents.
  • Using a dataset of 6,614 query pairs based on clinical trial abstracts, the authors compare effects of question framing (positive vs. negative) and language style (technical vs. plain).
  • Results show that positive/negative framing pairs are significantly more likely to yield contradictory conclusions than same-framing pairs, indicating sensitivity to framing even with identical underlying evidence.
  • The inconsistency is amplified in multi-turn conversations, where continued interaction increases persuasion-driven divergence.
  • The study finds no significant interaction between framing and language style, and concludes that phrasing robustness should be a key evaluation criterion for high-stakes RAG medical systems.

Abstract

Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.