High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- The paper proposes a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained vision-language models (VLMs), thereby preserving the original model intact.
- It introduces Logit-to-Code Distributional Mapping, which converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features to guide diffusion decoding.
- A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap.
- The approach achieves higher fidelity for both VQ-VAE reconstructions and text-to-image generations with short training on ImageNet-1K, while keeping the original VLM intact.
Related Articles

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER
Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine
Reddit r/LocalLLaMA
Today, what hardware to get for running large-ish local models like qwen 120b ?
Reddit r/LocalLLaMA
Running mistral locally for meeting notes and it's honestly good enough for my use case
Reddit r/LocalLLaMA
[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data
Reddit r/MachineLearning