Retrieval-Guided Generation for Safer Histopathology Image Captioning

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes retrieval-guided generation (RGG) to make histopathology image captioning safer by grounding captions in expert text from visually similar cases rather than generating them fully from scratch.
  • Experiments on the ARCH histopathology dataset show improved semantic alignment with ground truth, with cosine similarity around 0.60 for RGG versus about 0.47 for MedGemma.
  • A pathologist-led qualitative review finds that RGG better preserves morphology-relevant terminology and produces fewer unsupported diagnostic claims than fully generative captioning.
  • The authors identify remaining failure modes for RGG, including concept mixing and the risk of inheriting overly specific labels from the retrieved sources.
  • Overall, the study argues that retrieval-guided captioning enables more transparent, auditable outputs compared with fully generative vision-language approaches in medical settings.

Abstract

Generative vision-language models can produce fluent medical image captions but remain prone to hallucination, over-specific diagnostic claims, and factual inconsistency-serious issues in pathology. We investigate retrieval-guided generation (RGG) as a safer alternative, where captions are formed by summarizing expert text from visually similar cases rather than generated de novo. On the ARCH histopathology dataset, RGG improves semantic alignment with ground truth, achieving cosine similarity of \approx0.60 versus \approx0.47 from MedGemma, with non-overlapping confidence intervals indicating a robust gain. A pathologist-led qualitative review shows better preservation of morphology-relevant terminology and fewer unsupported diagnoses, while revealing failure modes such as concept mixing and inherited over-specific labeling. Overall, retrieval-guided captioning offers a more transparent and reliable approach with clearer opportunities for auditing than fully generative methods.