FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • FORGE は、実世界の 2D 画像と 3D 点群を組み合わせ、型番などの細かな製造ドメイン意味論で注釈したマルチモーダル評価用データセットを構築した。
  • これを用いて製造タスク(ワークピース検証、構造表面検査、組立検証)に対し 18 の最先端 MLLM を評価し、大きな性能ギャップを明らかにした。
  • ボトルネック分析の結果、従来の見方に反して主因は視覚グラウンディング不足ではなく、ドメイン固有知識の不足であることを示した。
  • さらに注釈の構造化を学習資源として活用でき、3B パラメータ級のモデルを FORGE で教師あり微調整すると、保留シナリオで精度が最大 90.8%(相対)改善することを報告した。

Abstract

The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.