VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces VQ-Jarvis, a retrieval-augmented video restoration agent aimed at handling heterogeneous real-world degradations better than fixed pipelines.
  • It proposes “sharp vision” via VSR-Compare, a large-scale paired video enhancement dataset (20K comparison pairs) spanning 7 degradation types and 11 enhancement operators.
  • VQ-Jarvis uses trained judge and degradation-perception models to distinguish subtle quality differences among candidate restorations and to guide agent decisions.
  • For “fast thought,” it combines one-step retrieval for easier videos with hierarchical step-by-step greedy search for more difficult cases to balance efficiency and accuracy.
  • Experiments reported in the article indicate that VQ-Jarvis outperforms existing video restoration approaches on complex degraded videos.

Abstract

Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.