Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
arXiv cs.CV / 4/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Omni-NegCLIP, a fine-tuned version of CLIP designed to improve vision-language models’ understanding of two common forms of negation: presence-based and absence-based negation.
- It modifies CLIP’s original InfoNCE contrastive loss using separate objectives that (a) separate images from presence-based negated captions while pulling them toward original caption embeddings, and (b) better align images with both original and absence-based negated captions while preserving semantic distinction between the relevant text embeddings.
- The authors fine-tune only the front transformer layers of CLIP’s text encoder during training, based on observations that earlier layers learn negated text representations more effectively than later layers.
- Experiments report substantial gains over pretrained CLIP, including up to a 52.65% improvement for presence-based negation and up to a 12.50% improvement for absence-based negation, while not degrading— and sometimes improving—general image-text retrieval performance (up to 19.62%).
- Compared with prior approaches, Omni-NegCLIP is presented as having a more comprehensive capability across multiple negation task types.
Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to
[P] Federated Adversarial Learning
Reddit r/MachineLearning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility
Towards Data Science