Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

arXiv cs.AI / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper proposes MED-VRAG, an iterative multimodal RAG system for medical QA that retrieves and reasons over original PMC page images (not OCR text chunks).
MED-VRAG uses patch-level page embeddings and an offline coarse-to-fine index to keep Stage-1 retrieval fast (under 30 ms) while scaling to about 350K pages.
A vision-language model refines queries and accumulates evidence in a memory bank across up to 3 reasoning rounds, taking about 15.9 seconds per iteration and 47.8 seconds for the full pipeline on 4xA100.
On four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MED-VRAG achieves 78.6% average accuracy, including a +5.8 point gain from adding retrieval versus no-retrieval.
Ablation results show contributions from using page-image retrieval (+1.0), iteration (+1.5), and the memory bank (+1.0), highlighting how multimodal evidence handling improves answer quality.

Abstract

Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Automating FDA Compliance: AI for Specialty Food Producers

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

I hate this group but not literally

Reddit r/LocalLLaMA

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Automating FDA Compliance: AI for Specialty Food Producers

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

I hate this group but not literally

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer