Context Unrolling in Omni Models

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Omni, a unified multimodal model trained natively across text, images, videos, 3D geometry, and internal (hidden) representations.
  • The authors argue that this training produces “Context Unrolling,” a mechanism where the model explicitly reasons over multiple modal representations prior to generating outputs.
  • Omni is claimed to better aggregate complementary signals across heterogeneous modalities, improving how faithfully it approximates the shared multimodal knowledge space.
  • The model reportedly achieves strong results on multimodal generation and understanding benchmarks, with demonstrated capabilities for generating text, images, videos, and 3D geometry in-context.
  • Overall, the work positions Context Unrolling as a pathway to higher downstream reasoning fidelity for multimodal systems.

Abstract

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.