PLaMo 2.1-VL Technical Report
arXiv cs.CV / 4/22/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces PLaMo 2.1-VL, a lightweight vision-language model aimed at autonomous devices with Japanese-language operation, with 8B and 2B variants suitable for local/edge deployment.
- It targets two main capabilities—Visual Question Answering (VQA) and Visual Grounding—and evaluates performance on Japanese and English benchmarks.
- The work provides a large-scale synthetic data generation pipeline plus Japanese training and evaluation resources to support model development and assessment.
- On reported benchmarks, PLaMo 2.1-VL reaches 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4, outperforming comparable open models.
- For real-world scenarios, it achieves 53.9% zero-shot accuracy for factory task analysis via tool recognition, and fine-tuning on power-plant data raises anomaly detection bbox + label F1-score from 39.7 to 64.9.
Related Articles

Autoencoders and Representation Learning in Vision
Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to

Now Meta will track what employees do on their computers to train its AI agents
The Verge
Context Bloat in AI Agents
Dev.to

We open sourced the AI dev team that builds our product
Dev.to