Nucleus-Image: Sparse MoE for Image Generation

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

Nucleus-Image is a text-to-image diffusion transformer that uses sparse mixture-of-experts (MoE) with expert-choice routing to match or exceed leading quality while activating only ~2B parameters per forward pass.
The model scales to 17B total parameters across 64 routed experts per layer and improves inference efficiency by excluding text tokens from the transformer backbone and reusing text KV across timesteps.
To stabilize routing with timestep modulation, it introduces a decoupled routing approach that separates timestep-aware expert assignment from timestep-conditioned expert computation.
Training used 1.5B high-quality text-image pairs (700M unique images) with multi-stage filtering, deduplication, aesthetic tiering, and a progressive resolution curriculum up to 1024, plus progressive expert capacity sparsification.
The authors report strong benchmark performance without post-training methods like reinforcement learning or preference optimization and release an open-source training recipe, positioning it as a first-of-its-kind open MoE diffusion model at this quality.

Abstract

We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/15DailyView insight →

Black Hat Asia

AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking

Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026

Dev.to

Nucleus-Image: Sparse MoE for Image Generation

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

The Complete Guide to Better Meeting Productivity with AI Note-Taking

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

How AI Interview Assistants Are Changing Job Preparation in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer