Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

arXiv cs.CV / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SGREC, an interpretable zero-shot referring expression comprehension method that identifies target objects in images using natural-language queries without task-specific training data.
Instead of relying solely on feature-similarity matching (as many VLM-based approaches do), SGREC builds a query-driven scene graph that encodes spatial relationships, descriptive captions, and object interactions relevant to the query.
The method then uses an LLM to infer the target object from the structured textual representation of the scene graph, providing detailed explanations to improve interpretability of decisions.
Experiments report strong zero-shot performance across multiple RefCOCO and RefCOCOg benchmarks, including RefCOCO val (66.78%), RefCOCO+ testB (53.43%), and RefCOCOg val (73.28%).

Abstract

Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

Dev.to

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Dev.to

Neural Networks in Mobile Robot Motion

Dev.to

Retraining vs Fine-tuning or Transfer Learning? [D]

Reddit r/MachineLearning

Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Key Points

Abstract

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Neural Networks in Mobile Robot Motion

Retraining vs Fine-tuning or Transfer Learning? [D]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer