Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The study directly compares domain fine-tuning versus retrieval-augmented generation (RAG) for medical multiple-choice question answering using a controlled 2×2 experimental design at the 4B-parameter scale.
It holds key variables constant (model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol) and varies only whether the model is domain-adapted (Gemma 3 4B vs. MedGemma 4B) and whether retrieved medical passages are inserted into the prompt.
On the MedQA-USMLE 4-option test split (1,273 questions, evaluated with majority vote across repeated calls), domain fine-tuning improves accuracy by +6.8 percentage points over the general 4B baseline (53.3% vs. 46.4%), with McNemar p < 10^-4.
Adding RAG using retrieved passages from a medical knowledge corpus does not yield a statistically significant gain for either the general or domain-tuned model, and shows a slightly negative, non-significant point estimate for the domain-tuned model (-1.9 pp, p = 0.16).
The authors conclude that, at this scale and on this benchmark, domain knowledge encoded in model weights outperforms domain knowledge provided via context retrieval, and they release code and JSONL traces for replication.

Abstract

Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.

Black Hat USA

AI Business

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

Key Points

Abstract

Related Articles

Black Hat USA

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer