GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection
arXiv cs.CV / 3/31/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces GUIDED, a decomposition framework for fine-grained open-vocabulary object detection that targets failures caused by semantic entanglement between object subjects and descriptive attributes in VLM embeddings.
- GUIDED separates the problem into distinct pathways for localization and fine-grained recognition: it extracts a coarse-grained subject plus attributes via a language model, then guides localization using only the subject embedding to prevent mislocalization and embedding drift.
- To avoid losing useful descriptive cues, the method adds an attention-based attribute embedding fusion module that selectively incorporates helpful attributes into detection queries while reducing attribute over-representation.
- It further improves recognition by using a region-level attribute discrimination module that compares detected regions against full fine-grained class names with a refined vision-language model and projection head for better embedding alignment.
- Experiments on FG-OVD and 3F-OVD benchmarks report new state-of-the-art performance, and the authors plan to release the code on GitHub.
Related Articles
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK
Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization
Dev.to