Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how prompt construction critically affects zero-shot vision foundation model (VFM) performance for agricultural object detection, focusing on cowpea flower and pod detection in complex field imagery.
  • It introduces a systematic prompt optimization framework that decomposes prompts into eight axes and shows that prompt structures beneficial for one detector architecture can significantly degrade others.
  • Experiments across four open-vocabulary detectors (YOLO World, SAM3, Grounding DINO, OWLv2) demonstrate substantial improvements from model-specific combinatorial prompts versus a naive species-name baseline (e.g., ~+0.35 mAP@0.5 on synthetic flower data).
  • Using an LLM-driven prompt translation strategy, the authors evaluate cross-task generalization from flower to morphologically distinct pods and find that synthetic-optimized prompt structures transfer well to real-world fields.
  • Overall, the work argues that effective prompt engineering can substantially narrow the gap between zero-shot VFMs and supervised detectors without manual annotation, while emphasizing that optimal prompts are non-obvious and architecture-specific.

Abstract

Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target -- cowpea pods -- and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.