MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MMCORE is a unified framework for multimodal image generation and editing that uses a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens.
The predicted embeddings are then used as conditioning signals for a diffusion model, transferring VLM reasoning capabilities into the visual synthesis process.
The approach avoids deep fusion or training from scratch between autoregressive and diffusion models, which reduces computational overhead while preserving high-fidelity image generation.
MMCORE supports text-to-image synthesis as well as interleaved image generation, showing strong performance on tasks requiring spatial reasoning and visual grounding.
Evaluation results report consistent outperformance over state-of-the-art baselines across multiple text-to-image and single/multi-image editing benchmarks.

Abstract

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

Just what the doctor ordered: how AI could help China bridge the medical resources gap

SCMP Tech

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

Automating Advanced Customization in Your Music Studio

Dev.to

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Dev.to

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Dev.to

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Key Points

Abstract

Related Articles

Just what the doctor ordered: how AI could help China bridge the medical resources gap

Why don't Automatic speech Recognition models use prompting? [D]

Automating Advanced Customization in Your Music Studio

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer