Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 従来のRL2Rベースの画像品質評価は全体スコア中心で、シャープネスや色忠実度、ノイズ量、構図美などの複数属性を同時に扱えていない点を問題提起しています。
  • 提案手法MG-IQAは、単一推論パスで「全体品質」と「細かな品質属性」を同時に推定するマルチグラニュラリティの推論フレームワークです。
  • 属性に応じた構造化推論を引き出すattribute-aware prompting、属性ごとの報酬を扱う多次元Thurstone報酬モデル、そして合成歪み・実歪み・AI生成画像間での安定学習のためのcross-domain alignmentを導入しています。
  • 8つのIQAベンチマークで、既存SOTAを全体予測と属性レベルの両面で上回り、全体品質予測ではSRCCが平均2.1%改善し、人間に整合した解釈可能な品質記述も生成できると報告されています。

Abstract

Recent advances in reasoning-induced image quality assessment (IQA) have demonstrated the power of reinforcement learning to rank (RL2R) for training vision-language models (VLMs) to assess perceptual quality. However, existing approaches operate at a single granularity, predicting only an overall quality score, while overlooking the multi-dimensional nature of human quality perception, which encompasses attributes such as sharpness, color fidelity, noise level, and compositional aesthetics. In this paper, we propose MG-IQA (Multi-Granularity IQA), a multi-granularity reasoning framework that extends RL2R to jointly assess overall image quality and fine-grained quality attributes within a single inference pass. Our approach introduces three key innovations: (1) an attribute-aware prompting strategy that elicits structured multi-attribute reasoning from VLMs; (2) a multi-dimensional Thurstone reward model that computes attribute-specific fidelity rewards for group relative policy optimization; and (3) a cross-domain alignment mechanism that enables stable joint training across synthetic distortion, authentic distortion, and AI-generated image datasets without perceptual scale re-alignment. Extensive experiments on eight IQA benchmarks demonstrate that MG-IQA consistently outperforms state-of-the-art methods in both overall quality prediction (average SRCC improvement of 2.1\%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.