CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

arXiv cs.CV / 4/27/2026

📰 NewsModels & Research

共有:

Key Points

The paper addresses a key reliability problem in open-vocabulary scene graph generation: relation predictions can be biased by language priors or object co-occurrence rather than grounded visual evidence.
It introduces CAGE-SGG, an evidence-rounded framework that verifies candidate relations using counterfactual relation verification instead of accepting language-plausible proposals directly.
The method generates open-vocabulary relation candidates with a vision-language proposer, decomposes predicate phrases into soft evidence bases (e.g., support, contact, containment, depth, motion, state), and uses a relation-conditioned evidence encoder to extract predicate-relevant cues.
A counterfactual verifier checks whether the relation score drops when necessary evidence is removed and stays stable under irrelevant perturbations, improving grounding reliability.
Experiments across multiple SGG benchmarks show consistent gains in recall metrics, unseen-predicate generalization, and counterfactual grounding quality, arguing that “relation verification” is more reliable and interpretable than “relation generation.”

Abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

The five loops between AI coding and AI engineering

Dev.to

A Machine Learning Model for Stock Market Prediction

Dev.to

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

MarkTechPost

Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]

Reddit r/MachineLearning

Anthropic's magic code-sniffer: More Swiss cheese than cheddar, for now

The Register

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

Key Points

Abstract

Related Articles

The five loops between AI coding and AI engineering

A Machine Learning Model for Stock Market Prediction

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]

Anthropic's magic code-sniffer: More Swiss cheese than cheddar, for now

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer