AI Navigate

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

arXiv cs.CV / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Unified Multimodal Models (UMMs) struggle because their visual generation components rely on inefficient training paradigms and scarce high-quality text-image data.
  • The paper proposes Image-Only Training for UMMs (IOMM), a two-stage framework where the visual generator is pre-trained exclusively on unlabeled image data, followed by fine-tuning with unlabeled images plus a small set of text-image pairs to improve instruction alignment and generation quality.
  • IOMM-B (3.6B) trained from scratch in about 1050 GPU hours (mostly image-only pre-training) achieves 0.89 GenEval and 0.55 WISE, surpassing BAGEL-7B and BLIP3-o-4B.
  • Code for IOMM is available at the project repository.

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their \textbf{visual generation components}, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for \textbf{UMM visual generation} and identify these two issues as the major bottlenecks. To address them, we propose \textbf{Image-Only Training for UMMs (IOMM)}, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component \textbf{exclusively} using abundant unlabeled image-only data, thereby removing the dependency on paired data \textbf{for this costly phase}. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only \sim \textbf{1050} H800 GPU hours (with the vast majority, \textbf{1000} hours, dedicated to the efficient \textbf{image-only pre-training stage}). It achieves \textbf{0.89} on GenEval and \textbf{0.55} on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available \href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}.