Bottleneck Tokens for Unified Multimodal Retrieval

arXiv cs.LG / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

本論文は、デコーダー型マルチモーダルLLMの「統一マルチモーダル検索」適用における構造的ギャップ（暗黙プーリングの限界と、コントラスト学習のトークン圧縮指導不足）を指摘している。
解決策として、少数の学習可能なBottleneck Tokens（BToks）を導入し、固定容量の明示的なプーリングとして系列情報を圧縮・集約する設計を提案している。
学習では「Generative Information Condensation」を用い、次トークン予測に加えてCondensation Maskで対象トークンからクエリトークンへの直接的な注意経路を遮断することで、予測信号をBToks経由に強制し、トークンレベルの圧縮監督を実現している。
推論時は入力とBToksの単一フォワードで済み、従来のlast-tokenプーリングと比べてオーバーヘッドは小さいと述べており、MMEB-V2で2B規模手法の中でSOTA（Overall 59.0、Video-QAで+12.6）を報告している。

Abstract

Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., ) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).