Context- and Pixel-aware Large Language Model for Video Quality Assessment

arXiv cs.CV / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CP-LLM, a context- and pixel-aware multimodal LLM designed to improve video quality assessment beyond pixel-only or purely discriminative approaches.
  • CP-LLM uses two dedicated vision encoders to separately capture high-level video context and low-level pixel distortion signals, then a language decoder reasons about how these factors interact.
  • The model is intended to handle both tasks jointly—quality scoring and quality description—rather than treating them as separate, potentially disconnected outputs.
  • Experiments on video quality assessment benchmarks show CP-LLM achieves state-of-the-art results across datasets and demonstrates stronger sensitivity and robustness to pixel-level distortions such as compression artifacts.

Abstract

Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.