Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

arXiv cs.CV / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Omni-NegCLIP, a fine-tuned version of CLIP designed to improve vision-language models’ understanding of two common forms of negation: presence-based and absence-based negation.
  • It modifies CLIP’s original InfoNCE contrastive loss using separate objectives that (a) separate images from presence-based negated captions while pulling them toward original caption embeddings, and (b) better align images with both original and absence-based negated captions while preserving semantic distinction between the relevant text embeddings.
  • The authors fine-tune only the front transformer layers of CLIP’s text encoder during training, based on observations that earlier layers learn negated text representations more effectively than later layers.
  • Experiments report substantial gains over pretrained CLIP, including up to a 52.65% improvement for presence-based negation and up to a 12.50% improvement for absence-based negation, while not degrading— and sometimes improving—general image-text retrieval performance (up to 19.62%).
  • Compared with prior approaches, Omni-NegCLIP is presented as having a more comprehensive capability across multiple negation task types.

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.