Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

arXiv cs.CV / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current unified multimodal retrieval methods for large multimodal models often optimize only sample-level objectives, neglecting subject-level semantics needed for coherent cross-modal grouping.
It identifies two failure modes—semantic alignment deviation (mislocalizing text-referred regions in images) and visual modality neglect (over-reliance on text cues)—as key drivers of poor cross-modal retrieval.
The authors propose Salient Subject-Aware Multimodal Embedding (SSA-ME), which uses LMMs plus visual experts to detect salient visual concepts in image–text pairs and aligns cross-modal attention with semantically meaningful regions.
A feature regeneration module recalibrates visual features using saliency maps to ensure balanced, semantically coherent integration between modalities.
Experiments on the MMEB benchmark show state-of-the-art retrieval performance, with qualitative analyses supporting improved interpretability and effectiveness.

Abstract

Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

Dev.to

Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

Key Points

Abstract

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer