Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Fundus-R1 is a reasoning-enhanced multimodal LLM for fundus image reading, trained entirely on public datasets to reduce the reproducibility and access barriers of prior clinically paired training data.
  • The approach uses a RAG-based mechanism to automatically generate image-specific, knowledge-aware reasoning traces that connect visual findings to ophthalmic knowledge grounded in available labels.
  • To improve reasoning reliability, the paper enhances RLVR by adding a process reward that promotes self-consistency of the generated reasoning trace across rollouts.
  • Experiments on FunBench, Omni-Fundus, and GMAI-Fundus report that Fundus-R1 outperforms baselines, including a generic model (Qwen2.5-VL) and variants that were post-trained without the generated reasoning traces.
  • The work suggests a feasible pathway for building stronger fundus-reading MLLMs using public data rather than inaccessible in-house clinical samples.

Abstract

Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.