TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

arXiv cs.RO / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • TacVLA is a fine-tuned vision-language-action (VLA) model for robotic manipulation that improves performance in contact-rich, occlusion-prone, and fine-grained tasks by adding tactile inputs to a transformer policy.
  • It introduces a contact-aware gating mechanism that activates tactile tokens only when contact is detected, reducing irrelevant tactile interference and enabling adaptive multimodal fusion.
  • The approach jointly processes visual, language, and tactile tokens in the transformer to strengthen cross-modal grounding during physical interactions.
  • Experiments on constraint-locked disassembly, in-box picking, and robustness tests show sizable gains over baselines, including ~20% average improvement in disassembly, ~60% in in-box picking, and a 2.1× boost under visual occlusion.
  • The authors provide videos and plan to release code, supporting reproducibility and further evaluation of tactile-enhanced VLA policies.

Abstract

Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.