Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion

arXiv cs.RO / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses why precise robotic insertion is hard in real environments, citing contact-rich dynamics, tight clearances, and limited demonstration data as key bottlenecks for existing visuomotor imitation learning approaches.
  • It introduces pose-guided imitation learning that uses compact, object-centric relative SE(3) poses and applies a diffusion policy to predict future relative pose trajectories as actions for insertion.
  • To handle pose estimation noise, the method augments pose features with goal-conditioned RGBD encoding and uses a pose-guided residual gated fusion module that lets RGBD cues compensate when pose estimates are unreliable.
  • Experiments on six real-robot precise insertion tasks show strong performance with only 7–10 demonstrations per task, including successful operation at clearances as low as 0.01 mm, with better data efficiency and generalization than baselines.
  • The authors report that code will be released via the provided GitHub link, supporting reproducibility and further research on pose- and diffusion-based insertion policies.

Abstract

Imitation learning is promising for robotic manipulation, but \emph{precise insertion} in the real world remains difficult due to contact-rich dynamics, tight clearances, and limited demonstrations. Many existing visuomotor policies depend on high-dimensional RGB/point-cloud observations, which can be data-inefficient and generalize poorly under pose variations. In this paper, we study pose-guided imitation learning by using object poses in \mathrm{SE}(3) as compact, object-centric observations for precise insertion tasks. First, we propose a diffusion policy for precise insertion that observes the \emph{relative} \mathrm{SE}(3) pose of the source object with respect to the target object and predicts a future relative pose trajectory as its action. Second, to improve robustness to pose estimation noise, we augment the pose-guided policy with RGBD cues. Specifically, we introduce a goal-conditioned RGBD encoder to capture the discrepancy between current and goal observations. We further propose a pose-guided residual gated fusion module, where pose features provide the primary control signal and RGBD features adaptively compensate when pose estimates are unreliable. We evaluate our methods on six real-robot precise insertion tasks and achieve high performance with only 7--10 demonstrations per task. In our setup, the proposed policies succeed on tasks with clearances down to 0.01~mm and demonstrate improved data efficiency and generalization over existing baselines. Code will be available at https://github.com/sunhan1997/PoseInsert.