DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

arXiv cs.CV / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DPC-VQA, arguing that pretrained multimodal LLMs provide a strong perceptual prior for video quality assessment while the key issue is efficiently calibrating outputs to a target MOS space.
  • DPC-VQA freezes the base MLLM for quality estimation and adds a lightweight calibration branch that predicts a residual correction, avoiding expensive end-to-end retraining.
  • Experiments on UGC and AIGC video quality assessment benchmarks show competitive results versus baseline methods while training with under 2% of the trainable parameters typical of conventional MLLM-based approaches.
  • The approach remains effective with only 20% of the MOS labels, reducing the annotation burden for adapting to new scenarios.
  • The authors state that code will be released upon publication.

Abstract

Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.