LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

arXiv cs.CV / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

LLaDA2.0-Uni is a new unified discrete diffusion LLM (dLLM) designed to perform both multimodal understanding and generation in a single native framework.
The model uses a semantic discrete tokenizer (via SigLIP-VQ) plus an MoE-based dLLM backbone to run block-level masked diffusion over discretized text and vision tokens.
A diffusion decoder reconstructs visual tokens into high-fidelity images, enabling image generation and editing alongside multimodal reasoning.
The authors improve inference efficiency using prefix-aware optimizations in the backbone and few-step distillation in the decoder, while scaling performance with curated large-scale data and a multi-stage training pipeline.
The work claims performance comparable to specialized VLMs for understanding, while also supporting interleaved generation and reasoning, and provides code/models publicly on GitHub.

Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/23DailyView insight →

Training ChatGPT on Private Data: A Technical Reference

Dev.to

Why all AI-coding plans are getting more expensive?

Dev.to

AI as a Fascist Artifact

Dev.to

Forget the Flashy Keynote — The A2A Protocol Is the Real Revolution From Google Cloud Next '26

Dev.to

Sony Ace: el robot que ganó 3 de 5 a élites de ping-pong en Nature

Dev.to

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Key Points

Abstract

💡 Insights using this article

Related Articles

Training ChatGPT on Private Data: A Technical Reference

Why all AI-coding plans are getting more expensive?

AI as a Fascist Artifact

Forget the Flashy Keynote — The A2A Protocol Is the Real Revolution From Google Cloud Next '26

Sony Ace: el robot que ganó 3 de 5 a élites de ping-pong en Nature

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer