Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The study directly compares domain fine-tuning versus retrieval-augmented generation (RAG) for medical multiple-choice question answering using a controlled 2×2 experimental design at the 4B-parameter scale.
  • It holds key variables constant (model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol) and varies only whether the model is domain-adapted (Gemma 3 4B vs. MedGemma 4B) and whether retrieved medical passages are inserted into the prompt.
  • On the MedQA-USMLE 4-option test split (1,273 questions, evaluated with majority vote across repeated calls), domain fine-tuning improves accuracy by +6.8 percentage points over the general 4B baseline (53.3% vs. 46.4%), with McNemar p < 10^-4.
  • Adding RAG using retrieved passages from a medical knowledge corpus does not yield a statistically significant gain for either the general or domain-tuned model, and shows a slightly negative, non-significant point estimate for the domain-tuned model (-1.9 pp, p = 0.16).
  • The authors conclude that, at this scale and on this benchmark, domain knowledge encoded in model weights outperforms domain knowledge provided via context retrieval, and they release code and JSONL traces for replication.

Abstract

Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.