Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key tension in vision-language reward modeling: generative reward models are more interpretable but slow, while discriminative models are efficient but opaque.
  • It introduces VL-MDR, which dynamically decomposes evaluation into multiple fine-grained, interpretable dimensions using a visual-aware gating mechanism and adaptive weighting per input.
  • The approach uses a newly curated dataset of 321k vision-language preference pairs annotated across 21 dimensions such as hallucination and reasoning to support the multidimensional reward framework.
  • Experiments report that VL-MDR outperforms existing open-source reward models on benchmarks including VL-RewardBench.
  • The authors show that preference pairs generated with VL-MDR can be used for DPO alignment to reduce visual hallucinations and improve reliability in VLMs.

Abstract

Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.