Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection
arXiv cs.CV / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how prompt construction critically affects zero-shot vision foundation model (VFM) performance for agricultural object detection, focusing on cowpea flower and pod detection in complex field imagery.
- It introduces a systematic prompt optimization framework that decomposes prompts into eight axes and shows that prompt structures beneficial for one detector architecture can significantly degrade others.
- Experiments across four open-vocabulary detectors (YOLO World, SAM3, Grounding DINO, OWLv2) demonstrate substantial improvements from model-specific combinatorial prompts versus a naive species-name baseline (e.g., ~+0.35 mAP@0.5 on synthetic flower data).
- Using an LLM-driven prompt translation strategy, the authors evaluate cross-task generalization from flower to morphologically distinct pods and find that synthetic-optimized prompt structures transfer well to real-world fields.
- Overall, the work argues that effective prompt engineering can substantially narrow the gap between zero-shot VFMs and supervised detectors without manual annotation, while emphasizing that optimal prompts are non-obvious and architecture-specific.



