Ukrainian Visual Word Sense Disambiguation Benchmark

arXiv cs.CV / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a new Ukrainian benchmark for the Visual Word Sense Disambiguation (Visual-WSD) task, focusing on selecting the correct meaning of an ambiguous word from a set of images with minimal context.
  • It adapts an established cross-language benchmark methodology previously used for English, Italian, and Farsi, enabling comparisons across languages.
  • The dataset was collected semi-automatically and refined with domain-expert input to improve labeling quality.
  • Experiments evaluate eight multilingual/multimodal large language models on the benchmark, finding that all tested models underperform a zero-shot CLIP-based baseline used in the English Visual-WSD benchmark.
  • The analysis identifies a large performance gap between Ukrainian and English on the Visual-WSD task, suggesting language-specific challenges for current multimodal models.

Abstract

This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.