Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

arXiv cs.CV / 4/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a fine-grained action segmentation task focused on renorrhaphy during robot-assisted partial nephrectomy, emphasizing frame-level recognition of visually similar suturing gestures with variable durations and heavy class imbalance.
It proposes and evaluates the SIA-RAPN benchmark using 50 da Vinci Xi clinical videos with 12 frame-level labels and released split configurations, enabling standardized comparison of temporal segmentation models.
Four temporal models based on I3D features are compared—MS-TCN++, AsFormer, TUT, and DiffAct—using metrics such as balanced accuracy, edit score, segmental F1 at multiple IoU thresholds, and frame-wise accuracy/mAP.
On the primary benchmark, DiffAct reports the strongest overall performance (highest F1, frame-wise accuracy, edit score, and frame mAP), while MS-TCN++ leads specifically on balanced accuracy.
The benchmark also includes cross-domain evaluation on a separate single-port RAPN dataset, assessing generalization beyond the primary da Vinci Xi setting.

Abstract

Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.