Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

arXiv cs.AI / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Dynin-Omni is introduced as a masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding/generation as well as video understanding within one architecture.
The model differs from autoregressive and compositional unified approaches by performing omnimodal learning as masked diffusion over a shared discrete token space with iterative refinement using bidirectional context.
It uses a multi-stage training strategy, including model-merging-based modality expansion and subsequent omnimodal alignment to support broad multimodal capabilities.
Across 19 multimodal benchmarks, Dynin-Omni reports strong results across reasoning (e.g., GSM8K), image tasks (e.g., MME-P), video understanding (e.g., VideoMME), and speech recognition (e.g., LibriSpeech WER).
The authors argue that masked diffusion provides a flexible unified paradigm for any-to-any modeling that could enable real-time omnimodal systems and embodied multimodal agents via cross-modal retrieval and generation.

Abstract

We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.