Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how prompt construction critically affects zero-shot vision foundation model (VFM) performance for agricultural object detection, focusing on cowpea flower and pod detection in complex field imagery.
It introduces a systematic prompt optimization framework that decomposes prompts into eight axes and shows that prompt structures beneficial for one detector architecture can significantly degrade others.
Experiments across four open-vocabulary detectors (YOLO World, SAM3, Grounding DINO, OWLv2) demonstrate substantial improvements from model-specific combinatorial prompts versus a naive species-name baseline (e.g., ~+0.35 mAP@0.5 on synthetic flower data).
Using an LLM-driven prompt translation strategy, the authors evaluate cross-task generalization from flower to morphologically distinct pods and find that synthetic-optimized prompt structures transfer well to real-world fields.
Overall, the work argues that effective prompt engineering can substantially narrow the gap between zero-shot VFMs and supervised detectors without manual annotation, while emphasizing that optimal prompts are non-obvious and architecture-specific.

Abstract

Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target -- cowpea pods -- and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

Key Points

Abstract

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer