TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

arXiv cs.RO / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

TacVLA is a fine-tuned vision-language-action (VLA) model for robotic manipulation that improves performance in contact-rich, occlusion-prone, and fine-grained tasks by adding tactile inputs to a transformer policy.
It introduces a contact-aware gating mechanism that activates tactile tokens only when contact is detected, reducing irrelevant tactile interference and enabling adaptive multimodal fusion.
The approach jointly processes visual, language, and tactile tokens in the transformer to strengthen cross-modal grounding during physical interactions.
Experiments on constraint-locked disassembly, in-box picking, and robustness tests show sizable gains over baselines, including ~20% average improvement in disassembly, ~60% in in-box picking, and a 2.1× boost under visual occlusion.
The authors provide videos and plan to release code, supporting reproducibility and further evaluation of tactile-enhanced VLA policies.

Abstract

Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

I asked my AI agent to design a product launch image. Here's what came back.

Dev.to

TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

I asked my AI agent to design a product launch image. Here's what came back.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer