Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a new “blind” bitstream-corrupted video recovery setting that avoids needing manually provided corruption masks, making restoration more practical for real-world degradation.
  • It proposes a Metadata-Guided Diffusion Model (M-GDM) that uses intrinsic video metadata (e.g., motion vectors and frame types) via a dual-stream encoder and cross-attention at each diffusion step to identify corrupted regions and guide reconstruction.
  • A prior-driven mask predictor generates pseudo masks from metadata and diffusion priors, enabling separation of intact versus to-be-recovered latent regions through hard masking and recombination.
  • To reduce visible seams and boundary artifacts from imperfect mask estimation, the method adds a post-refinement module that improves consistency between preserved and restored areas.
  • Experiments reportedly show strong performance and superiority over prior blind video recovery approaches, with released code on GitHub.

Abstract

Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corrupted regions and recovering content from extensive and irregular degradations. We propose a Metadata-Guided Diffusion Model (M-GDM) to tackle these challenges. Specifically, intrinsic video metadata are leveraged as corruption indicators through a dual-stream metadata encoder that separately embeds motion vectors and frame types before fusing them into a unified representation. This representation interacts with corrupted latent features via cross-attention at each diffusion step. To preserve intact regions, we design a prior-driven mask predictor that generates pseudo masks using both metadata and diffusion priors, enabling the separation and recombination of intact and recovered regions through hard masking. To mitigate boundary artifacts caused by imperfect masks, a post-refinement module enhances consistency between intact and recovered regions. Extensive experiments demonstrate the effectiveness of our method and its superiority in blind video recovery. Code is available at: https://github.com/Shuyun-Wang/M-GDM.