Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

arXiv cs.CV / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Unified Multimodal Models (UMMs) struggle because their visual generation components rely on inefficient training paradigms and scarce high-quality text-image data.
The paper proposes Image-Only Training for UMMs (IOMM), a two-stage framework where the visual generator is pre-trained exclusively on unlabeled image data, followed by fine-tuning with unlabeled images plus a small set of text-image pairs to improve instruction alignment and generation quality.
IOMM-B (3.6B) trained from scratch in about 1050 GPU hours (mostly image-only pre-training) achieves 0.89 GenEval and 0.55 WISE, surpassing BAGEL-7B and BLIP3-o-4B.
Code for IOMM is available at the project repository.

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their

\textbf{visual generation components}

, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for

\textbf{UMM visual generation}

and identify these two issues as the major bottlenecks. To address them, we propose

\textbf{Image-Only Training for UMMs (IOMM)}

, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component

\textbf{exclusively}

using abundant unlabeled image-only data, thereby removing the dependency on paired data

\textbf{for this costly phase}

. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only

\sim \textbf{1050}

H800 GPU hours (with the vast majority,

\textbf{1000}

hours, dedicated to the efficient

\textbf{image-only pre-training stage}

). It achieves

\textbf{0.89}

on GenEval and

\textbf{0.55}

on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available

\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

note

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

note

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

note

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

note

人の言葉を喋る「ロボット盲導犬」は、視覚障害者の方々の自立支援の一助となるか

note

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Key Points

Abstract

Related Articles

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

人の言葉を喋る「ロボット盲導犬」は、視覚障害者の方々の自立支援の一助となるか

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

​報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

諸葛亮 孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話 その肆拾伍『銀河文明･ダークマターエンジン』

人の言葉を喋る「ロボット盲導犬」は、視覚障害者の方々の自立支援の一助となるか

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』