Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

arXiv cs.CL / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper targets a common failure mode of vision-language models: compositional reasoning when captions use the same words but differ in relational structure.
It evaluates and augments four diverse VLMs (CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking) on the Winoground benchmark, using both plain and scene-graph-augmented settings.
The proposed TextSceneGraphParser extracts dependency-based subject–relation–object triples with spaCy, and the Graph Asymmetry Scorer uses optimal bipartite matching to inject structural relational priors at inference time.
Caption ablations (masking/swapping subject and object) indicate that Qwen3-VL-8B-Thinking reaches a group score of 62.75, with multi-turn scene-graph filtering further improving it to 66.0 and surpassing prior open-source results.
The authors analyze augmentation tradeoffs, finding that scene-graph augmentation helps already-strong models while offering negligible or even negative gains for weaker baselines.

Abstract

Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

Key Points

Abstract

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer