CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

arXiv cs.CV / 4/27/2026

📰 NewsModels & Research

Key Points

  • The paper addresses a key reliability problem in open-vocabulary scene graph generation: relation predictions can be biased by language priors or object co-occurrence rather than grounded visual evidence.
  • It introduces CAGE-SGG, an evidence-rounded framework that verifies candidate relations using counterfactual relation verification instead of accepting language-plausible proposals directly.
  • The method generates open-vocabulary relation candidates with a vision-language proposer, decomposes predicate phrases into soft evidence bases (e.g., support, contact, containment, depth, motion, state), and uses a relation-conditioned evidence encoder to extract predicate-relevant cues.
  • A counterfactual verifier checks whether the relation score drops when necessary evidence is removed and stays stable under irrelevant perturbations, improving grounding reliability.
  • Experiments across multiple SGG benchmarks show consistent gains in recall metrics, unseen-predicate generalization, and counterfactual grounding quality, arguing that “relation verification” is more reliable and interpretable than “relation generation.”

Abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.