Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies cross-modal coreference (linking the same real-world referent across different modalities) as an overlooked blocker for reliable omni-modal reasoning in Omni-LLMs.
  • It formalizes the task as locating a referent in one modality and re-identifying it in another, and introduces the CrossOmni dataset with nine tasks and human-designed reasoning rationales.
  • Experiments across 13 Omni-LLMs show systematic weaknesses in cross-modal coreference, attributed to the lack of coreference-aware thinking patterns.
  • To improve alignment, the authors propose a training-free in-context learning approach and a training-based SFT+GRPO framework to induce coreference-aware reasoning, both delivering substantial gains.
  • The improvements also generalize to collaborative reasoning tasks, positioning cross-modal coreference as a key missing component for robust omni-modal models.

Abstract

Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.