Cross-Modal Visuo-Tactile Object Perception

arXiv cs.RO / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Cross-Modal Latent Filter (CMLF), a method for estimating physical object properties for contact-rich robotic manipulation using both vision and tactile sensing.
  • CMLF learns a structured, causal latent state-space of object properties and performs Bayesian inference to update beliefs over time, rather than relying on purely static alignment or forceful fusion.
  • It enables bidirectional transfer of priors between vision and touch, helping address uncertainty and difficult-to-model effects like non-rigid deformation and nonlinear contact friction.
  • Real-world robotic experiments show improved efficiency and robustness in latent physical property estimation under uncertainty compared with baseline approaches.
  • The model also exhibits human-like perceptual coupling phenomena, such as susceptibility to cross-modal illusions and comparable learning trajectories across sensory modalities.

Abstract

Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.