Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery
arXiv cs.CV / 3/18/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The article introduces a speech-guided embodied agent framework for video-guided skull base surgery that responds to surgeon queries.
- It combines natural language interaction with real-time visual perception on live intraoperative video streams, eliminating the need for external optical trackers.
- The system starts with interactive segmentation and labeling of the surgical instrument, using the segmented instrument as a spatial anchor to support downstream tasks like anatomical segmentation, registration, tool pose estimation, and real-time overlays.
- Evaluation shows competitive spatial accuracy compared with a commercial optical tracking system and highlights improved workflow integration and potential for rapid deployment of video-guided surgical systems.




