Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study tests whether self-reflective (self-corrective) prompting can improve large language models’ accuracy on medical multiple-choice question answering beyond chain-of-thought (CoT) prompting.
Using GPT-4o and GPT-4o-mini, the authors compare standard CoT against an iterative self-reflection loop and observe how answers change across reflection steps on MedQA, HeadQA, and PubMedQA.
Results show self-reflection does not consistently raise accuracy; benefits are modest on MedQA but limited or even negative on HeadQA and PubMedQA.
The number of reflection steps does not reliably improve performance, indicating diminishing returns and potential error persistence or new error introduction.
The paper concludes that self-reflective reasoning may be more useful for analyzing model behavior than as a dependable, safety-critical method to improve medical QA reliability.

Abstract

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.