CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CREG (Compass Relational Evidence Graph), a training-free interpretability method that maps multi-layer contrastive Grad×Act attributions into a reference-centered polar (compass-sector) coordinate system to identify inferred directional relations in vision-language models.
  • It evaluates directional explanations using three new metrics—Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS)—to measure how well directional evidence matches the intended geometry and whether it is causally faithful.
  • Experiments on Qwen2-VL-7B show consistent improvements over standard attribution baselines, including a 16.1° reduction in angular error versus attention rollout and a +0.120 improvement in EA on COCO-Pairs.
  • The causal occlusion tests on 540 samples yield COS values ≥ +0.42, supporting the faithfulness of the directional explanations.
  • Results are weaker on Qwen2-VL-2B, suggesting CREG benefits from more structured spatial representations that become clearer at larger model scales.

Abstract

Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.