PLaMo 2.1-VL Technical Report

arXiv cs.CV / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces PLaMo 2.1-VL, a lightweight vision-language model aimed at autonomous devices with Japanese-language operation, with 8B and 2B variants suitable for local/edge deployment.
  • It targets two main capabilities—Visual Question Answering (VQA) and Visual Grounding—and evaluates performance on Japanese and English benchmarks.
  • The work provides a large-scale synthetic data generation pipeline plus Japanese training and evaluation resources to support model development and assessment.
  • On reported benchmarks, PLaMo 2.1-VL reaches 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4, outperforming comparable open models.
  • For real-world scenarios, it achieves 53.9% zero-shot accuracy for factory task analysis via tool recognition, and fine-tuning on power-plant data raises anomaly detection bbox + label F1-score from 39.7 to 64.9.

Abstract

We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.