CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

arXiv cs.LG / 4/3/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces CRIT, a new dataset and benchmark designed to better evaluate cross-modal multi-hop reasoning by constructing tasks that require connecting textual context with visual evidence over multiple steps.
It argues existing multimodal benchmarks and training data often under-enforce complementary, multi-hop reasoning because they rely too heavily on single-modality cues or weak interleaving of image-text information.
CRIT is generated via a graph-based automatic pipeline, covers diverse domains (including natural images, videos, and text-rich sources), and provides a manually verified test set to support more reliable evaluation.
Experimental results indicate that even state-of-the-art vision-language models perform poorly on CRIT-style reasoning tasks, highlighting a gap in current model capabilities.
Training on CRIT leads to significant improvements in cross-modal multi-hop reasoning and yields gains not only on SPIQA but also on other standard multimodal benchmarks.

Abstract

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

Dev.to

LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

Dev.to

Inside the Creative Artificial Intelligence (AI) Stack: Where Human Vision and Artificial Intelligence Meet to Design Future Fashion

MarkTechPost

AI Citation Volatility: Why 60% of Your Sources Disappear Every Month

Dev.to

90% людей выбирают нейросети наугад. И теряют время.

Dev.to

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

Key Points

Abstract

Related Articles

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

Inside the Creative Artificial Intelligence (AI) Stack: Where Human Vision and Artificial Intelligence Meet to Design Future Fashion

AI Citation Volatility: Why 60% of Your Sources Disappear Every Month

90% людей выбирают нейросети наугад. И теряют время.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer