Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

arXiv cs.CL / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that cross-modal reasoning in multimodal LLMs is still unclear because prior studies lack controlled evaluation and internal analysis to explain when extra modalities help or hurt.
It proposes a logic-grounded evaluation framework that classifies multimodal reasoning into six interaction patterns based on how facts are distributed across modalities and combined logically.
Empirical results show adding modalities improves reasoning only when they offer independent and sufficient reasoning pathways, while redundant or chained entailment tends to degrade performance.
The authors identify two key bottlenecks—task-composition (recognition and reasoning can’t be done jointly in one pass) and fusion (early integration introduces bias)—and demonstrate that two-step prompting can mitigate the task-composition issue.
They find that early fusion can bias attention and that softening early attention in fusion improves reasoning, suggesting that controlling fusion and training with composition awareness are promising directions.

Abstract

Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.