An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing surgical vision-language datasets do not adequately capture fine-grained interleaved spatial-temporal dynamics needed for robust surgical video understanding by VLMs.
It introduces the SurgSTU-Pipeline, a deterministic dataset-generation approach that uses temporal and spatial continuity filtering to reduce reliance on costly manual labels or error-prone synthetic generation.
Using this pipeline on public surgical datasets, the authors build SurgSTU with 7,515 densely extended video clips and 150k fine-grained spatial-temporal question-answer samples.
Experiments show generalist VLMs perform poorly on spatial-temporal tasks in zero-shot mode, but improve with in-context learning.
A fine-tuned VLM trained on SurgSTU attains the best results across spatial-temporal tasks, and the authors plan to release the code publicly.

Abstract

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.

Black Hat Asia

AI Business

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

Dev.to

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Dev.to

Why the same codebase should always produce the same audit score

Dev.to

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

Dev.to

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

Key Points

Abstract

Related Articles

Black Hat Asia

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Why the same codebase should always produce the same audit score

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer