SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes SLQ, a framework for adapting frozen multimodal large language models (MLLMs) into retrievers without changing or fine-tuning the backbone parameters.
SLQ appends a small set of Shared Latent Queries to both text and image token sequences so the model’s causal attention can act as a global aggregation interface to produce compact embeddings in a unified space.
The authors argue that retrieval adaptation should elicit existing pre-trained representations rather than overwriting them, to avoid disrupting semantic space and structured knowledge needed for reasoning.
They introduce KARR-Bench, a benchmark aimed at knowledge-aware reasoning retrieval to better evaluate performance beyond shallow pattern matching.
Experiments report SLQ outperforming full fine-tuning and LoRA on COCO and Flickr30K, while also performing competitively on MMEB and delivering substantial gains on KARR-Bench.

Abstract

Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model's native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.