GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection

arXiv cs.CV / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GUIDED, a decomposition framework for fine-grained open-vocabulary object detection that targets failures caused by semantic entanglement between object subjects and descriptive attributes in VLM embeddings.
  • GUIDED separates the problem into distinct pathways for localization and fine-grained recognition: it extracts a coarse-grained subject plus attributes via a language model, then guides localization using only the subject embedding to prevent mislocalization and embedding drift.
  • To avoid losing useful descriptive cues, the method adds an attention-based attribute embedding fusion module that selectively incorporates helpful attributes into detection queries while reducing attribute over-representation.
  • It further improves recognition by using a region-level attribute discrimination module that compares detected regions against full fine-grained class names with a refined vision-language model and projection head for better embedding alignment.
  • Experiments on FG-OVD and 3F-OVD benchmarks report new state-of-the-art performance, and the authors plan to release the code on GitHub.

Abstract

Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.