Disparities In Negation Understanding Across Languages In Vision-Language Models

arXiv cs.CL / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

Vision-language models often show affirmation bias, mistakenly selecting positive captions even when the correct answer requires negation.
The study shows that negation behavior differs by language due to factors like morphology, word order, and cliticization, which may affect how well existing fixes generalize.
Researchers introduce the first human-verified multilingual negation benchmark covering seven diverse languages (English, Mandarin, Arabic, Greek, Russian, Tagalog, Spanish).
Evaluation of CLIP, SigLIP, and MultiCLIP finds that standard CLIP is at or below chance on non-Latin-script languages, while MultiCLIP delivers the highest and most consistent accuracy.
A proposed negation-correction method (SpaceVLM) improves results for multiple languages, but effectiveness varies across typologically different languages, highlighting fairness-relevant interactions between language properties and model improvements.

Abstract

Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions ("X is present") even when the correct description contains negation ("no X"). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

Disparities In Negation Understanding Across Languages In Vision-Language Models

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer