Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that MM-RAG re-rankers can be misled by visual distractors because they often score retrieved candidates using a full-image global embedding for image-question queries.
  • It introduces Region-R1, a query-side region-cropping framework that learns a policy to decide whether to use the whole image or crop to a question-relevant region before re-ranking.
  • Region-R1 formulates region selection as a decision-making problem and trains using a region-aware group relative policy optimization method (r-GRPO).
  • Experiments on E-VQA and InfoSeek show consistent improvements, with results up to 20% higher conditional Recall@1 and state-of-the-art performance reported for the evaluated setups.

Abstract

Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.