CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis

arXiv cs.CV / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIndustry & Market MovesModels & Research

Key Points

  • CropVLM is a domain-adapted vision-language model designed to address the agricultural “phenotyping bottleneck,” where manual plant trait measurement is slow and biased.
  • The model is trained on 52,987 manually curated image-caption pairs across 37 crop species in natural field conditions, using Domain-Specific Semantic Alignment (DSSA) to connect agronomic terms to fine-grained visual features.
  • CropVLM enables open-set crop analysis via the proposed Hybrid Open-Set Localization Network (HOS-Net), allowing detection of novel crops from natural language descriptions without retraining.
  • In evaluations, CropVLM reaches 72.51% zero-shot classification accuracy and outperforms seven CLIP-style baselines.
  • The released weights and pipeline, along with benchmark results (e.g., 49.17 AP50 on CVTCropDet and 50.73 AP50 on tropical fruit species), indicate strong zero-shot generalization versus the next-best method.

Abstract

High-throughput plant phenotyping, the quantitative measurement of observable plant traits, is critical for modern breeding but remains constrained by a "phenotyping bottleneck," where manual data collection is labor-intensive and prone to observer bias. Conventional closed-set computer vision systems fail to address this challenge, as they require extensive species-specific annotation and lack the flexibility to handle diverse breeding populations. To bridge this gap, we present CropVLM, a Vision-Language Model (VLM) adapted for the agricultural domain via Domain-Specific Semantic Alignment (DSSA). Trained on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions, CropVLM effectively maps agronomic terminology to fine-grained visual features. We further introduce the Hybrid Open-Set Localization Network (HOS-Net), an architecture that integrates CropVLM to enable the detection of novel crops solely from natural language descriptions without retraining. By eliminating the reliance on species-specific training data, CropVLM provides a scalable solution for high-throughput phenotyping, accelerating genetic gain and facilitating large-scale biodiversity research essential for sustainable agriculture. The trained model weights and complete pipeline implementation are publicly available at: [https://github.com/boudiafA/CropVLM](https://github.com/boudiafA/CropVLM). In comprehensive evaluations, CropVLM achieves 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines. Our detection pipeline demonstrates superior zero-shot generalization to novel species, achieving 49.17 AP50 on our CVTCropDet benchmark and 50.73 AP50 on tropical fruit species, compared to 34.89 and 48.58 for the next-best method, respectively.