A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

arXiv cs.RO / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a Vision-Language-Action (VLA) model to unify robotic ultrasound (RUS)-based needle tracking and adaptive needle insertion under dynamic imaging conditions.
  • It introduces a Cross-Depth Fusion (CDF) tracking head that combines shallow positional signals with deep semantic features to support real-time end-to-end tracking.
  • To adapt a large pretrained vision backbone for tracking efficiently, the authors add a Tracking-Conditioning (TraCon) register for parameter-efficient feature conditioning.
  • For insertion control, the method uses an uncertainty-aware control policy plus an asynchronous VLA pipeline to enable timely, safer decisions.
  • Experiments report consistent improvements over state-of-the-art trackers and manual operation, including higher tracking accuracy, better insertion success rates, and shorter procedure time.

Abstract

Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the large-scale vision backbone. To adapt the pretrained vision backbone for tracking tasks, a Tracking-Conditioning (TraCon) register is introduced for parameter-efficient feature conditioning. After needle tracking, an uncertainty-aware control policy and an asynchronous VLA pipeline are presented for adaptive needle insertion control, ensuring timely decision-making for improved safety and outcomes. Extensive experiments on both needle tracking and insertion show that our method consistently outperforms state-of-the-art trackers and manual operation, achieving higher tracking accuracy, improved insertion success rates, and reduced procedure time, highlighting promising directions for RUS-based intelligent intervention.