When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current language models often struggle with idiomatic and culturally grounded meaning because they over-rely on surface-level lexical/semantic cues rather than metaphorical intent.
  • It presents “Mediom,” a multilingual, multimodal corpus covering 3,533 Hindi, Bengali, and Thai idioms, with gold-standard explanations, cross-lingual translations, and aligned text–image representations to enable figurative disambiguation evaluation.
  • The authors benchmark both large language models (textual reasoning) and vision-language models on Mediom, finding systematic failures in metaphor and idiom comprehension.
  • To address these gaps, they propose “HIDE,” a hinting-based idiom explanation framework that uses error-feedback retrieval and diagnostic cues for iterative reasoning improvement.
  • Overall, Mediom and HIDE are positioned as a rigorous test bed and methodology for building next-generation AI systems capable of culturally grounded, multimodal idiom understanding.

Abstract

Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit{\foreignlanguage{bengali}{\char"0986\char"0999\char"09CD\char"0997\char"09C1 \char"09B0 \char"09AB\char"09B2 \char"099F\char"0995}} (angur fol tok, ``grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present ``Mediom,'' a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text--image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,'' a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.