Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

arXiv cs.CV / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current large vision-language models often fail when images act as clues and the answer depends on multi-step cognitive reasoning beyond explicit visual recognition.
It introduces RebusBench, a benchmark containing 1,164 rebus puzzles designed to test neurosymbolic capability by requiring perception-to-language attribute extraction, idiom/linguistic prior retrieval, and abstract mapping to generate meaning outside pixel space.
Evaluations on models such as Qwen, InternVL, and LLaVA show severe limitations, with results saturating below 10% Exact Match and 20% semantic accuracy.
The authors report no significant gains from model scaling or in-context learning, suggesting missing “reasoning glue” rather than missing raw visual or linguistic components.
The work positions rebus-style tasks as a diagnostic for integration of visual understanding with external knowledge and systematic reasoning.

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.