GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

arXiv cs.LG / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The GAZE framework enables medical vision-language models to work like radiologists by iteratively using viewer-level image tools (e.g., zoom, windowing, contrast, edge detection) and two literature/image retrieval tools (PubMed and Open-i).
  • GAZE emphasizes auditability and reliability by producing structured, schema-validated outputs and recording full tool-call traces for medical review and evaluation.
  • On the NOVA benchmark (906 rare-brain-MRI cases across 281 rare neurological conditions), GAZE achieves 58.2 mAP@IoU 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol without task-specific fine-tuning.
  • The authors show that even before any tool use, structured prompting with schema validation improves performance over a Gemini 2.0 Flash baseline, indicating the framework design itself is a key experimental factor.
  • Tool use yields especially large gains for rare pathologies, while retrieval ablation results highlight a model-dependent trade-off where diagnostic improvements may come with localisation decreases—supporting joint evaluation of diagnosis, localisation, and captioning.

Abstract

Vision-language models (VLMs) read an image and produce text in a single forward pass, whereas radiologists typically inspect an image several times and consult the literature before writing a report. We introduce GAZE (Grounded Agentic Zero-shot Evaluation), a framework that lets a medical VLM work in this iterative way by calling viewer-level tools (zoom, windowing, contrast, edge detection) and two retrieval tools backed by the U.S. National Library of Medicine (PubMed for medical literature, Open-i for radiological images), with structured outputs validated against a schema and full tool-call traces recorded for auditability. On NOVA, a benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE reaches 58.2 mean average precision (mAP) at intersection-over-union (IoU) 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol that scores captioning, diagnosis, and localisation from the image alone, without task-specific fine-tuning. Before any tool is used, structured prompting and schema-validated outputs already improve over the published Gemini 2.0 Flash baseline (20.2 to 29.4 mAP@0.3), so framework design is itself an experimental variable. Tool use helps rare pathologies disproportionately: the fraction of cases with IoU > 0.3 rises from 17% to 58% for diagnoses with three or fewer examples versus 25% to 68% for common conditions (\geq10 cases), with gains tracking engagement (Gemini 3 Flash: Cohen's d = 0.79, 11.8 tool calls per case; Gemini 2.0 Flash: tools used in 8.2% of cases, no significant benefit). Retrieval ablations additionally reveal a model-dependent trade-off in which gains in diagnosis can coincide with losses in localisation, reinforcing the case for joint evaluation of diagnosis, localisation, and captioning in medical VLMs.