MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

Key Points

The work shifts evaluation from isolated grounding to workflow-aware sequential reasoning and provides code and data on GitHub for broader adoption.

Abstract

Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.

MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

Key Points

Abstract

Related Articles

[D] Matryoshka Representation Learning

Two new Qwen3.5 “Neo” fine‑tunes focused on fast, efficient reasoning

HKIC, Gobi Partners and HKU team up for fund backing university research start-ups

Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling

Streaming experts

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Related Articles

[D] Matryoshka Representation Learning
Reddit r/MachineLearning

Two new Qwen3.5 “Neo” fine‑tunes focused on fast, efficient reasoning
Reddit r/LocalLLaMA

HKIC, Gobi Partners and HKU team up for fund backing university research start-ups
SCMP Tech

Yann LeCun’s New LeWorldModel (LeWM) Research Targets JEPA Collapse in Pixel-Based Predictive World Modeling
MarkTechPost

Streaming experts
Simon Willison's Blog