MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • MMCORE is a unified framework for multimodal image generation and editing that uses a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens.
  • The predicted embeddings are then used as conditioning signals for a diffusion model, transferring VLM reasoning capabilities into the visual synthesis process.
  • The approach avoids deep fusion or training from scratch between autoregressive and diffusion models, which reduces computational overhead while preserving high-fidelity image generation.
  • MMCORE supports text-to-image synthesis as well as interleaved image generation, showing strong performance on tasks requiring spatial reasoning and visual grounding.
  • Evaluation results report consistent outperformance over state-of-the-art baselines across multiple text-to-image and single/multi-image editing benchmarks.

Abstract

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.