Instrument-Splatting++: Towards Controllable Surgical Instrument Digital Twin Using Gaussian Splatting

arXiv cs.RO / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Instrument-Splatting++ is presented as a monocular 3D Gaussian Splatting framework to build high-fidelity, controllable digital twins of surgical instruments for Real2Sim in robot-assisted surgery.
  • The method uses part-wise geometry pretraining that injects CAD priors into Gaussian primitives and enables part-aware semantic rendering to support more controllable reconstructions.
  • It proposes SAPET (semantics-aware pose estimation and tracking) to recover per-frame 6-DoF instrument pose and joint angles from unposed endoscopic videos, using a gripper-tip network trained purely from synthetic semantics and regularization to reduce problematic articulations.
  • Robust Texture Learning (RTL) alternates pose refinement with robust appearance optimization to reduce the impact of pose noise during texture learning.
  • Experiments on EndoVis17/18, SAR-RARP, and an in-house dataset show improved photometric quality and geometric accuracy versus prior baselines, and the controllable instrument Gaussian improves a downstream keypoint detection task via unseen-pose augmentation.

Abstract

High-quality and controllable digital twins of surgical instruments are critical for Real2Sim in robot-assisted surgery, as they enable realistic simulation, synthetic data generation, and perception learning under novel poses. We present Instrument-Splatting++, a monocular 3D Gaussian Splatting (3DGS) framework that reconstructs surgical instruments as a fully controllable Gaussian asset with high fidelity. Our pipeline starts with part-wise geometry pretraining that injects CAD priors into Gaussian primitives and equips the representation with part-aware semantic rendering. Built on the pretrained model, we propose a semantics-aware pose estimation and tracking (SAPET) method to recover per-frame 6-DoF pose and joint angles from unposed endoscopic videos, where a gripper-tip network trained purely from synthetic semantics provides robust supervision and a loose regularization suppresses singular articulations. Finally, we introduce Robust Texture Learning (RTL), which alternates pose refinement and robust appearance optimization, mitigating pose noise during texture learning. The proposed framework can perform pose estimation and learn realistic texture from unposed videos. We validate our method on sequences extracted from EndoVis17/18, SAR-RARP, and an in-house dataset, showing superior photometric quality and improved geometric accuracy over state-of-the-art baselines. We further demonstrate a downstream keypoint detection task where unseen-pose data augmentation from our controllable instrument Gaussian improves performance.