Inference-Time Structural Reasoning for Compositional Vision-Language Understanding
arXiv cs.CL / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper targets a common failure mode of vision-language models: compositional reasoning when captions use the same words but differ in relational structure.
- It evaluates and augments four diverse VLMs (CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking) on the Winoground benchmark, using both plain and scene-graph-augmented settings.
- The proposed TextSceneGraphParser extracts dependency-based subject–relation–object triples with spaCy, and the Graph Asymmetry Scorer uses optimal bipartite matching to inject structural relational priors at inference time.
- Caption ablations (masking/swapping subject and object) indicate that Qwen3-VL-8B-Thinking reaches a group score of 62.75, with multi-turn scene-graph filtering further improving it to 66.0 and surpassing prior open-source results.
- The authors analyze augmentation tradeoffs, finding that scene-graph augmentation helps already-strong models while offering negligible or even negative gains for weaker baselines.
Related Articles

Black Hat Asia
AI Business
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside
Dev.to

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to