From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • 本研究は、ソイとコットンの植物フェノタイピングを対象にした「PlantXpert」というエビデンスに基づくマルチモーダルLLMベンチマークを提案し、農学的推論を評価・比較できる枠組みを示した。
  • ベンチマークは385枚のデジタル画像と3,000件超のサンプルから構成され、病害・害虫・雑草管理・収量など複数の領域にまたがって視覚的専門性、定量推論、多段の農学的推論を測定する。
  • 11種類の最先端VLMを評価した結果、ドメイン特化のファインチューニングにより精度が大きく改善し、Qwen3-VL-4B/30Bでは最大78%まで到達した。
  • 一方で、モデル規模の拡大による改善は一定以上で頭打ちになり、ソイとコットン間の汎化は不均一で、定量的かつ生物学的に根拠づけられた推論には依然として難しさが残ると結論づけた。
  • PlantXpertは、農学分野におけるエビデンスに基づくマルチモーダル推論の評価基盤として、植物科学向けモデル開発を前進させる用途が期待される。

Abstract

To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.