MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

arXiv cs.CV / 3/27/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper argues that current evaluations of medical vision-language models oversimplify clinical practice by using curated 2D images rather than requiring agents to explore full 3D, multi-sequence/multi-modality studies.
  • It proposes MEDOPENCLAW, an auditable runtime that enables VLM-based agents to operate dynamically inside standard medical viewers/tools such as 3D Slicer.
  • It introduces MEDFLOWBENCH, a full-study benchmark for multi-sequence brain MRI and lung CT/PET that compares agentic performance across viewer-only, tool-use, and open-method settings.
  • Initial results show a performance paradox: strong LLM/VLMs can complete basic study navigation in viewer-only mode, but degrade when given access to professional support tools, attributed to insufficient precise spatial grounding.
  • The authors position MEDOPENCLAW and MEDFLOWBENCH as a reproducible foundation for building and evaluating auditable, interactive medical imaging agents.

Abstract

Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.
広告