High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- The paper proposes a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained vision-language models (VLMs), thereby preserving the original model intact.
- It introduces Logit-to-Code Distributional Mapping, which converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features to guide diffusion decoding.
- A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap.
- The approach achieves higher fidelity for both VQ-VAE reconstructions and text-to-image generations with short training on ImageNet-1K, while keeping the original VLM intact.
Related Articles
Self-Refining Agents in Spec-Driven Development
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks
Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights
Reddit r/LocalLLaMA
Best open source coding models for claude code? LB?
Reddit r/LocalLLaMA