DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning
arXiv cs.CV / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- DiffVC proposes a non-autoregressive video captioning framework that uses a diffusion model to overcome the slow generation speed and cumulative error typical of autoregressive encoder-decoder approaches.
- The method encodes videos into visual representations, injects Gaussian noise into the ground-truth text during training, and uses a discriminative conditional denoiser constrained by the visual features to generate new text representations.
- During inference, DiffVC generates captions by sampling noise from a Gaussian distribution and avoiding token-by-token autoregressive decoding, enabling parallel generation.
- Experiments on MSVD, MSR-VTT, and VATEX indicate improved caption quality versus prior non-autoregressive methods, reaching up to +9.9 CIDEr and +2.6 B@4 while matching autoregressive performance and improving speed.
- The authors state that the source code will be available soon, which may accelerate adoption and further benchmarking of diffusion-based non-autoregressive captioning.



