Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests whether self-reflective (self-corrective) prompting can improve large language models’ accuracy on medical multiple-choice question answering beyond chain-of-thought (CoT) prompting.
  • Using GPT-4o and GPT-4o-mini, the authors compare standard CoT against an iterative self-reflection loop and observe how answers change across reflection steps on MedQA, HeadQA, and PubMedQA.
  • Results show self-reflection does not consistently raise accuracy; benefits are modest on MedQA but limited or even negative on HeadQA and PubMedQA.
  • The number of reflection steps does not reliably improve performance, indicating diminishing returns and potential error persistence or new error introduction.
  • The paper concludes that self-reflective reasoning may be more useful for analyzing model behavior than as a dependable, safety-critical method to improve medical QA reliability.

Abstract

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.