BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces BARD, a framework to convert a pretrained autoregressive vision-language model into a same-architecture, decoding-efficient diffusion VLM (dVLM) without major quality loss.
BARD uses progressive supervised block merging to gradually increase decoding block size, plus stage-wise intra-dVLM distillation from a fixed small-block diffusion “anchor” to recover performance degraded by larger blocks.
It improves robustness and denoising token revision via a mixed noise scheduler, and enables efficient training on long multimodal sequences with memory-friendly training techniques.
The authors find that distilling directly from the autoregressive regime into diffusion is poorly aligned and may hurt results, while distillation within the diffusion regime is consistently effective.
Experiments transferring capabilities from Qwen3-VL show strong multimodal performance with ≤4.4M data, achieving new state of the art among comparable-scale open dVLMs (at 4B and 8B) and up to 3× decoding throughput speedup versus the source model.

Abstract

Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with

\leq 4.4M

data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf{3

\times

} decoding throughput speedup compared to the source model.

A practical guide to getting comfortable with AI coding tools

Dev.to

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

🚀 Major BrowserAct CLI Update

Dev.to

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Key Points

Abstract

Related Articles

A practical guide to getting comfortable with AI coding tools

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

🚀 Major BrowserAct CLI Update

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer