OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • OmniJigsawは、強化学習ベースのポストトレーニング手法をオムニモーダル(映像・音声など)モデルへ拡張するための、自己教師ありフレームワークであり、シャッフルされた音声・映像クリップの時系列復元を代理課題として用います。
  • 手法はCross-modal統合を促すために、Joint Modality Integration、Sample-level Modality Selection、Clip-level Modality Maskingの3つの戦略でモダリティの扱いをオーケストレーションします。
  • 代理課題の「パズル品質」が性能に直結する点を踏まえ、大規模な未注釈データへの適応を効率化する二段階の粗密(coarse-to-fine)データ・フィルタリング・パイプラインを提案しています。
  • 分析では、Joint Modality Integrationにおける「bi-modal shortcut phenomenon」を指摘し、細粒度のClip-level modality maskingがこれを緩和して、sample-level modality selectionより優れると結論づけています。
  • 15のベンチマークで、動画理解・音声理解・協調的(collaborative)推論の各領域で大幅な改善を示し、自己教師ありオムニモーダル学習のスケーラブルな枠組みとして有効性が検証されています。

Abstract

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.