VLMaterial: Vision-Language Model-Based Camera-Radar Fusion for Physics-Grounded Material Identification

arXiv cs.RO / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • VLMaterialは、視覚(VLM+SAM)とmmWaveレーダーの情報を融合し、ガラス/プラスチックのような見た目が似た物体でも物理的に根拠づけられた材質識別を目指す手法として提案された。
  • レーダー側ではPRCA(effective peak reflection cell area)と重み付きベクトル合成により、電磁特性として誘電率を推定することで“物理パラメータ”を安定した参照として扱う。
  • VLMにはCAG(context-augmented generation)でレーダー特有の物理知識を与え、センサ間で整合しやすい意味解釈を可能にする。
  • 不確実性推定にもとづく適応的フュージョンにより、クロスモーダルの矛盾を解消して融合判断を行う。
  • 120超の実環境実験(41種の物体+視覚的に欺く4種の偽装)で認識精度96.08%を報告し、タスク特化の大規模学習なし(training-free)で既存のクローズドセット系ベンチマークに匹敵するとしている。

Abstract

Accurate material recognition is a fundamental capability for intelligent perception systems to interact safely and effectively with the physical world. For instance, distinguishing visually similar objects like glass and plastic cups is critical for safety but challenging for vision-based methods due to specular reflections, transparency, and visual deception. While millimeter-wave (mmWave) radar offers robust material sensing regardless of lighting, existing camera-radar fusion methods are limited to closed-set categories and lack semantic interpretability. In this paper, we introduce VLMaterial, a training-free framework that fuses vision-language models (VLMs) with domain-specific radar knowledge for physics-grounded material identification. First, we propose a dual-pipeline architecture: an optical pipeline uses the segment anything model and VLM for material candidate proposals, while an electromagnetic characterization pipeline extracts the intrinsic dielectric constant from radar signals via an effective peak reflection cell area (PRCA) method and weighted vector synthesis. Second, we employ a context-augmented generation (CAG) strategy to equip the VLM with radar-specific physical knowledge, enabling it to interpret electromagnetic parameters as stable references. Third, an adaptive fusion mechanism is introduced to intelligently integrate outputs from both sensors by resolving cross-modal conflicts based on uncertainty estimation. We evaluated VLMaterial in over 120 real-world experiments involving 41 diverse everyday objects and 4 typical visually deceptive counterfeits across varying environments. Experimental results demonstrate that VLMaterial achieves a recognition accuracy of 96.08%, delivering performance on par with state-of-the-art closed-set benchmarks while eliminating the need for extensive task-specific data collection and training.