Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs
arXiv cs.CL / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies cross-modal coreference (linking the same real-world referent across different modalities) as an overlooked blocker for reliable omni-modal reasoning in Omni-LLMs.
- It formalizes the task as locating a referent in one modality and re-identifying it in another, and introduces the CrossOmni dataset with nine tasks and human-designed reasoning rationales.
- Experiments across 13 Omni-LLMs show systematic weaknesses in cross-modal coreference, attributed to the lack of coreference-aware thinking patterns.
- To improve alignment, the authors propose a training-free in-context learning approach and a training-based SFT+GRPO framework to induce coreference-aware reasoning, both delivering substantial gains.
- The improvements also generalize to collaborative reasoning tasks, positioning cross-modal coreference as a key missing component for robust omni-modal models.
Related Articles

Black Hat Asia
AI Business
[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project
Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing
Dev.to

Every AI Agent Registry in 2026, Compared
Dev.to