ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

arXiv cs.CV / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ViTaPEs, a transformer-based method for learning task-agnostic visuotactile representations from paired vision and tactile inputs.
  • ViTaPEs improves cross-modal spatial reasoning by using a two-stage positional injection: modality-specific positional encodings within each stream and a shared global positional encoding on the joint token sequence before attention.
  • The authors explicitly test where positional information is injected (e.g., before token-wise nonlinearity versus immediately before self-attention) using controlled ablations.
  • Experiments across multiple large real-world datasets show ViTaPEs outperforming prior state-of-the-art baselines on recognition tasks and achieving zero-shot generalization to unseen out-of-domain environments.
  • The approach also transfers effectively to robotics, improving performance on a robotic grasping task by better predicting grasp success.

Abstract

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of \emph{ViTaPEs} in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes