Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

arXiv cs.CV / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper notes a common multimodal performance paradox where unimodal baselines can outperform multimodal joint inference in omni-modal LLMs.
It attributes this fragility to “static fusion” architectures, highlighting two structural issues: positional bias in sequential inputs and alignment traps in interleaved formats that distort attention.
It proposes Chain of Modality (CoM), an agentic framework that replaces passive concatenation with dynamic orchestration of fusion topologies.
CoM adaptively switches among parallel, sequential, and interleaved pathways and splits cognition into “Direct-Decide” and “Reason-Decide” routes for faster perception and auditable reasoning.
The approach reportedly works under training-free or data-efficient supervised fine-tuning (SFT) and yields more robust, consistent generalization across benchmarks.

Abstract

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.