Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

arXiv cs.CV / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Surg-R1 presents a three-level hierarchical reasoning framework for surgical vision-language modeling, enabling perceptual grounding, relational understanding, and contextual reasoning with interpretable outputs.
It introduces the largest surgical chain-of-thought dataset with 320,000 reasoning pairs and a four-stage training pipeline evolving from supervised fine-tuning through group-relative policy optimization to iterative self-improvement.
On SurgBench and six external multi-center datasets from five institutions, Surg-R1 achieves the highest Arena Score of 64.9%, outperforming Gemini 3.0 Pro and GPT-5.1.
The model outperforms proprietary reasoning models and specialized surgical VLMs across tasks such as instrument localization, triplet recognition, phase/action recognition, and safety assessment, with a 15.2 percentage point gain on external validation.

Abstract

Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Reddit r/LocalLLaMA

Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer