Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation

arXiv cs.RO / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to fuse vision and force/torque (F/T) signals in diffusion-based robotic manipulation policies, focusing on contact-rich tasks where vision alone is insufficient.
  • It compares multiple existing integration strategies (e.g., auxiliary prediction objectives, mixture-of-experts, and contact-aware gating) to evaluate their relative effectiveness.
  • The authors introduce an adaptive fusion method that suppresses F/T inputs during non-contact phases and switches to using both vision and torque information during contact.
  • Experiments show the proposed adaptive approach improves success rate by 14% over the strongest baseline, underscoring the value of contact-aware multimodal fusion.
  • Overall, the work provides both a benchmark-style comparison of F/T-vision fusion designs and a practical architectural idea for improving contact-aware manipulation.

Abstract

Vision-based policies have achieved a good performance in robotic manipulation due to the accessibility and richness of visual observations. However, purely visual sensing becomes insufficient in contact-rich and force-sensitive tasks where force/torque (F/T) signals provide critical information about contact dynamics, alignment, and interaction quality. Although various strategies have been proposed to integrate vision and F/T signals, including auxiliary prediction objectives, mixture-of-experts architectures, and contact-aware gating mechanisms, a comparison of these approaches remains lacking. In this work, we provide a comparison study of different F/T-vision integration strategies within diffusion-based manipulation policies. In addition, we propose an adaptive integration strategy that ignores F/T signals during non-contact phases while adaptively leveraging both vision and torque information during contact. Experimental results demonstrate that our method outperforms the strongest baseline by 14% in success rate, highlighting the importance of contact-aware multimodal fusion for robotic manipulation.