DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
arXiv cs.AI / 4/22/2026
📰 NewsModels & Research
Key Points
- The paper introduces DT2IT-MRM, a multimodal reward modeling approach aimed at improving how multimodal LLMs are aligned with human preferences.
- It addresses key issues in existing preference data—insufficient granularity of preference strength, textual style bias, and unreliable preference signals—by proposing a debiased preference construction pipeline and a new text-to-image (T2I) preference reformulation.
- The method includes an iterative training framework designed to curate and reduce noise in existing open-source multimodal preference datasets in a scalable way.
- Experiments report new state-of-the-art overall performance across three benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
Related Articles

Rethinking CNN Models for Audio Classification
Dev.to
v0.20.0rc1
vLLM Releases
I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw
Dev.to

HNHN: Hypergraph Networks with Hyperedge Neurons
Dev.to

Anthropic’s Mythos is stoking cybersecurity fears. What does it mean for China?
SCMP Tech