AI Navigate

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

arXiv cs.AI / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • VQQA is a multi-agent framework for video quality evaluation and improvement that generalizes across text-to-video and image-to-video tasks.
  • It replaces traditional evaluation metrics with dynamic visual questions and Vision-Language Model critiques that serve as semantic gradients to guide optimization via a black-box natural language interface.
  • The approach enables a closed-loop prompt optimization process that efficiently isolates and fixes visual artifacts in just a few refinement steps, outperforming stochastic search and prompt optimization baselines.
  • Empirical results show absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2, demonstrating substantial quality gains over vanilla generation.

Abstract

Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.