Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning

arXiv cs.LG / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that multimodal contrastive methods like Symile can be fragile in settings beyond image-text pairs because multiplicative interaction terms can silently degrade performance when one modality is unreliable, misaligned, or missing.
  • It argues that Symile’s symmetric treatment of modalities masks failures—performance gains over pairwise baselines may persist even though an unreliable modality corrupts the product terms.
  • The authors introduce “Gated Symile,” an attention-based, per-candidate gating mechanism that suppresses unreliable modalities by interpolating toward learnable neutral directions and adding an explicit NULL option.
  • Experiments on a synthetic benchmark designed to reveal this failure mode and on three real-world trimodal datasets show that Gated Symile improves top-1 retrieval accuracy over tuned Symile and CLIP.
  • The work frames gating as a practical direction toward more robust multimodal contrastive learning under imperfect inputs and scenarios with more than two modalities.

Abstract

Multimodal contrastive learning is increasingly enriched by going beyond image-text pairs. Among recent contrastive methods, Symile is a strong approach for this challenge because its multiplicative interaction objective captures higher-order cross-modal dependence. Yet, we find that Symile treats all modalities symmetrically and does not explicitly model reliability differences, a limitation that becomes especially present in trimodal multiplicative interactions. In practice, modalities beyond image-text pairs can be misaligned, weakly informative, or missing, and treating them uniformly can silently degrade performance. This fragility can be hidden in the multiplicative interaction: Symile may outperform pairwise CLIP even if a single unreliable modality silently corrupts the product terms. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and incorporating an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets for which such failures could be masked by averages, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning under imperfect and more than two modalities.