VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

VitaTouch は、製造業の品質検査で必要となる「見た目の形状以外の素材・表面特性」を、視覚と触覚を統合して推定し、自然言語で属性記述できるビジョン-タッチ-言語モデルとして提案されています。
モダリティ別エンコーダと dual Q-Former により言語に有用な視覚・触覚特徴を抽出し、LLM への prefix tokens として圧縮して利用し、さらに対照学習で視覚と触覚の結び付きを明示的に強めています。
マルチモーダルデータセット VitaSet（186物体、52k画像、5.1kの人手検証付き instruction-answer）を構築し、硬さ・粗さ推定や物性記述で高い性能を報告しています。
LoRA による微調整で欠陥認識（2/3/5カテゴリ）やクローズドループ認識、エンドツーエンドの仕分け成功率について、ラボ内ロボット試験で高精度の結果を示しています。

Abstract

Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/