Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs
arXiv cs.CV / 3/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SGREC, an interpretable zero-shot referring expression comprehension method that identifies target objects in images using natural-language queries without task-specific training data.
- Instead of relying solely on feature-similarity matching (as many VLM-based approaches do), SGREC builds a query-driven scene graph that encodes spatial relationships, descriptive captions, and object interactions relevant to the query.
- The method then uses an LLM to infer the target object from the structured textual representation of the scene graph, providing detailed explanations to improve interpretability of decisions.
- Experiments report strong zero-shot performance across multiple RefCOCO and RefCOCOg benchmarks, including RefCOCO val (66.78%), RefCOCO+ testB (53.43%), and RefCOCOg val (73.28%).
Related Articles
I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial
Dev.to
The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage
Dev.to
AI 自主演化的時代來臨:從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage
Dev.to
Neural Networks in Mobile Robot Motion
Dev.to
Retraining vs Fine-tuning or Transfer Learning? [D]
Reddit r/MachineLearning