BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces BARD, a framework to convert a pretrained autoregressive vision-language model into a same-architecture, decoding-efficient diffusion VLM (dVLM) without major quality loss.
  • BARD uses progressive supervised block merging to gradually increase decoding block size, plus stage-wise intra-dVLM distillation from a fixed small-block diffusion “anchor” to recover performance degraded by larger blocks.
  • It improves robustness and denoising token revision via a mixed noise scheduler, and enables efficient training on long multimodal sequences with memory-friendly training techniques.
  • The authors find that distilling directly from the autoregressive regime into diffusion is poorly aligned and may hurt results, while distillation within the diffusion regime is consistently effective.
  • Experiments transferring capabilities from Qwen3-VL show strong multimodal performance with ≤4.4M data, achieving new state of the art among comparable-scale open dVLMs (at 4B and 8B) and up to 3× decoding throughput speedup versus the source model.

Abstract

Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with \leq 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf{3\times} decoding throughput speedup compared to the source model.