This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA
arXiv cs.CL / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how patient question phrasing affects large language model (LLM) responses in medical QA using a controlled retrieval-augmented generation (RAG) setup with expert-selected documents.
- Using a dataset of 6,614 query pairs based on clinical trial abstracts, the authors compare effects of question framing (positive vs. negative) and language style (technical vs. plain).
- Results show that positive/negative framing pairs are significantly more likely to yield contradictory conclusions than same-framing pairs, indicating sensitivity to framing even with identical underlying evidence.
- The inconsistency is amplified in multi-turn conversations, where continued interaction increases persuasion-driven divergence.
- The study finds no significant interaction between framing and language style, and concludes that phrasing robustness should be a key evaluation criterion for high-stakes RAG medical systems.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Meta's latest model is as open as Zuckerberg's private school
The Register

AI fuels global trade growth as China-US flows shift, McKinsey finds
SCMP Tech

Why multi-agent AI security is broken (and the identity patterns that actually work)
Dev.to
BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.
Reddit r/artificial