ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes ReCQR by introducing conversational query rewriting (CQR) as a new task for multimodal image retrieval, targeting issues with long or unclear user text queries.
It constructs a multi-turn dialogue rewriting dataset by using LLMs to generate candidate rewrites at scale and an LLM-as-judge plus manual review process to curate about 7,000 high-quality dialogues.
CQR rewrites a user’s final query into a concise, semantically complete form using full dialogue history, aiming to make queries more retrieval-friendly.
The authors benchmark state-of-the-art multimodal retrieval models on the ReCQR dataset and find that CQR significantly improves retrieval accuracy.
The work suggests broader modeling directions for how multimodal systems should interpret and transform conversational user intent before retrieval.

Abstract

With the rise of multimodal learning, image retrieval plays a crucial role in connecting visual information with natural language queries. Existing image retrievers struggle with processing long texts and handling unclear user expressions. To address these issues, we introduce the conversational query rewriting (CQR) task into the image retrieval domain and construct a dedicated multi-turn dialogue query rewriting dataset. Built on full dialogue histories, CQR rewrites users' final queries into concise, semantically complete ones that are better suited for retrieval. Specifically, We first leverage Large Language Models (LLMs) to generate rewritten candidates at scale and employ an LLM-as-Judge mechanism combined with manual review to curate approximately 7,000 high-quality multimodal dialogues, forming the ReCQR dataset. Then We benchmark several SOTA multimodal models on the ReCQR dataset to assess their performance on image retrieval. Experimental results demonstrate that CQR not only significantly enhances the accuracy of traditional image retrieval models, but also provides new directions and insights for modeling user queries in multimodal systems.

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval

Key Points

Abstract

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer