A vision-language model and platform for temporally mapping surgery from video

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • 本研究では、外科手術ビデオから時間的に手術行動を地図化するためのビジョン・ラングエージモデル「Halsted」を提案し、従来の単一手技内の限られた行動成分しか扱えない課題に取り組む。
  • Halstedは「Halsted Surgical Atlas(HSA)」で学習され、自己ラベリングの反復フレームワークにより8つの外科専門領域で650,000本超のビデオを含む大規模な注釈付きライブラリを基盤としている。
  • ベンチマーク向けにHSAのサブセット「HSA-27k」を公開し、従来の最先端モデルを上回る手術活動のマッピング性能と、より高い計算効率を示す。
  • 外科AIの臨床導入までの「翻訳(translational)ギャップ」を埋めるために、Halsted Webプラットフォームを開発し、現場の手術者が自分の手技を数分で自動マッピングできる仕組みを提供する。

Abstract

Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (https://halstedhealth.ai/) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.