Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
arXiv cs.CL / 4/28/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The study directly compares domain fine-tuning versus retrieval-augmented generation (RAG) for medical multiple-choice question answering using a controlled 2×2 experimental design at the 4B-parameter scale.
- It holds key variables constant (model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol) and varies only whether the model is domain-adapted (Gemma 3 4B vs. MedGemma 4B) and whether retrieved medical passages are inserted into the prompt.
- On the MedQA-USMLE 4-option test split (1,273 questions, evaluated with majority vote across repeated calls), domain fine-tuning improves accuracy by +6.8 percentage points over the general 4B baseline (53.3% vs. 46.4%), with McNemar p < 10^-4.
- Adding RAG using retrieved passages from a medical knowledge corpus does not yield a statistically significant gain for either the general or domain-tuned model, and shows a slightly negative, non-significant point estimate for the domain-tuned model (-1.9 pp, p = 0.16).
- The authors conclude that, at this scale and on this benchmark, domain knowledge encoded in model weights outperforms domain knowledge provided via context retrieval, and they release code and JSONL traces for replication.
Related Articles

Black Hat USA
AI Business
LLMs will be a commodity
Reddit r/artificial
Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to
HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA