CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
arXiv cs.AI / 4/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Compositional Grounded Contrast (CGC) to improve multimodal LLMs’ fine-grained multi-image understanding, addressing issues like spatial hallucination, attention leakage, and object non-constancy.
- CGC is designed as a low-cost framework that builds compositional multi-image training instances from existing single-image grounding annotations using Inter-Image and Intra-Image contrastive learning.
- It adds a rule-based spatial reward integrated into the GRPO framework to strengthen source-image attribution, spatial alignment, and the validity of structured outputs under a Think-before-Grounding strategy.
- Experiments report state-of-the-art performance on fine-grained multi-image benchmarks (MIG-Bench, VLM2-Bench) and transferable gains on broader multimodal reasoning tasks, improving over the Qwen3-VL-8B base across several benchmarks.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to