LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

arXiv cs.CV / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • LLaDA2.0-Uni is a new unified discrete diffusion LLM (dLLM) designed to perform both multimodal understanding and generation in a single native framework.
  • The model uses a semantic discrete tokenizer (via SigLIP-VQ) plus an MoE-based dLLM backbone to run block-level masked diffusion over discretized text and vision tokens.
  • A diffusion decoder reconstructs visual tokens into high-fidelity images, enabling image generation and editing alongside multimodal reasoning.
  • The authors improve inference efficiency using prefix-aware optimizations in the backbone and few-step distillation in the decoder, while scaling performance with curated large-scale data and a multi-stage training pipeline.
  • The work claims performance comparable to specialized VLMs for understanding, while also supporting interleaved generation and reasoning, and provides code/models publicly on GitHub.

Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.