High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained vision-language models (VLMs), thereby preserving the original model intact.
It introduces Logit-to-Code Distributional Mapping, which converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features to guide diffusion decoding.
A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap.
The approach achieves higher fidelity for both VQ-VAE reconstructions and text-to-image generations with short training on ImageNet-1K, while keeping the original VLM intact.

Abstract

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.

Self-Refining Agents in Spec-Driven Development

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks

Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights

Reddit r/LocalLLaMA

Best open source coding models for claude code? LB?

Reddit r/LocalLLaMA

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

M2.7 open weights coming in ~2 weeks

MiniMax M2.7 Will Be Open Weights

Best open source coding models for claude code? LB?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer