Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

arXiv cs.CL / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study investigates why vision-language models struggle on abstract visual reasoning tasks (e.g., Bongard problems) by separating the roles of “reasoning” versus “representation.”
Using Bongard-LOGO, the authors compare end-to-end VLMs that take raw images against LLMs that receive symbolic inputs extracted from those images.
They introduce a Componential–Grammatical (C–G) paradigm that reformulates the benchmark as symbolic reasoning over LOGO-style action programs or structured descriptions.
LLMs show large and consistent improvements, reaching mid-90s accuracy on free-form problems, while a strong visual baseline stays near chance when task definitions are matched.
Ablation results indicate that factors like input format, explicit concept prompts, and limited visual grounding are less influential than replacing pixel inputs with symbolic structure, pointing to representation as the key bottleneck.

Abstract

Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer