Tube Diffusion Policy: Reactive Visual-Tactile Policy Learning for Contact-rich Manipulation

arXiv cs.RO / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Tube Diffusion Policy (TDP), a reactive visual-tactile imitation learning framework designed for contact-rich manipulation under uncertainty and disturbances.
  • TDP combines diffusion-based imitation with tube-based feedback control, learning an observation-conditioned feedback flow around nominal action chunks to form an “action tube” for rapid corrections during execution.
  • Experiments on the Push-T benchmark plus three additional visual-tactile dexterous manipulation tasks show that TDP outperforms existing imitation learning baselines consistently.
  • Real-world tests confirm TDP’s robustness in handling contact uncertainty and external disturbances, and its step-wise correction reduces the number of denoising steps for real-time high-frequency control.
  • The proposed tube-based mechanism addresses a key limitation of action-chunking approaches by enabling reaction to unforeseen observations during execution.

Abstract

Contact-rich manipulation is central to many everyday human activities, requiring continuous adaptation to contact uncertainty and external disturbances through multi-modal perception, particularly vision and tactile feedback. While imitation learning has shown strong potential for learning complex manipulation behaviors, most existing approaches rely on action chunking, which fundamentally limits their ability to react to unforeseen observations during execution. This limitation becomes especially critical in contact-rich scenarios, where physical uncertainty and high-frequency tactile feedback demand rapid, reactive control. To address this challenge, we propose Tube Diffusion Policy (TDP), a novel reactive visual-tactile policy learning framework that bridges diffusion-based imitation learning with tube-based feedback control. By leveraging the expressive power of generative models, TDP learns an observation-conditioned feedback flow around nominal action chunks, forming an action tube that enables fast and adaptive reactions during execution. We evaluate TDP on the widely used Push-T benchmark and three additional challenging visual-tactile dexterous manipulation tasks. Across all benchmarks, TDP consistently outperforms state-of-the-art imitation learning baselines. Two real-world experiments further validate its robust reactivity under contact uncertainty and external disturbances. Moreover, the step-wise correction mechanism enabled by action tube significantly reduces the required denoising steps, making TDP well suited for real-time, high-frequency feedback control in contact-rich manipulation.