AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

arXiv cs.CV / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

AffordTissue is a new multimodal framework for predicting tool-action-specific safe tissue interaction regions in surgical settings, outputting dense affordance heatmaps for cholecystectomy.
The method combines a temporal vision encoder (capturing tool motion and tissue dynamics), language conditioning (to generalize across instrument-action pairs), and a DiT-style decoder for dense affordance prediction.
The paper introduces the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six tool-action pairs and four instruments.
Experiments report substantially better dense prediction accuracy than vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), suggesting task-specific architectures outperform general foundation models for this dense spatial reasoning task.
By explicitly localizing where instruments should interact safely, AffordTissue aims to improve surgical automation predictability and could enable policy guidance and early safe-stop when actions deviate from predicted zones.

Abstract

Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.