Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Q-DeepSight, a multimodal “think-with-image” framework for Image Quality Assessment (IQA) that provides actionable, localized feedback rather than only global scores.
  • Q-DeepSight uses interleaved Multimodal Chain-of-Thought with tool-augmented evidence collection (such as crop-and-zoom) to identify where quality drops and the visual reasons behind it.
  • To train long multimodal reasoning trajectories with reinforcement learning, the authors propose Perceptual Curriculum Reward (PCR) to reduce reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually grounded reasoning.
  • Experiments show state-of-the-art results on benchmarks covering natural, restored, and AI-generated imagery, and the model is further applied in a training-free loop via Perceptual-in-Generation (PiG) to iteratively improve images based on its diagnoses.

Abstract

Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight's diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.