SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

arXiv cs.CV / 4/23/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces SurgCoT, a unified benchmark to evaluate chain-of-thought (CoT) spatiotemporal reasoning in multimodal LLMs using surgical videos across 7 specialties and 35 procedures.
  • SurgCoT measures five key reasoning dimensions: causal action ordering, cue-action alignment, affordance mapping, micro-transition localization, and anomaly onset tracking, using a structured CoT framework with an intensive annotation protocol.
  • The annotation design uses separate Knowledge and Clue fields to provide background context and definitive spatiotemporal evidence for each question.
  • Experiments with 10 leading MLLMs find that commercial models outperform open-source and medical-specialized variants, and that substantial gaps remain in surgical CoT reasoning.
  • The authors position SurgCoT as a reproducible testbed and pathway for narrowing the gap between current MLLM abilities and clinical reasoning requirements, with code released on GitHub.

Abstract

Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.