Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

arXiv cs.CV / 4/20/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Mind’s Eye,” a multiple-choice benchmark for evaluating multimodal LLMs on visual cognitive and visuospatial reasoning through eight tasks categorized by an Abstraction–Relation–Transformation (A-R-T) taxonomy.
  • The benchmark targets core “fluid intelligence” abilities such as pattern induction, analogical relation mapping, and mental transformation.
  • Across evaluations of both open- and closed-source MLLMs, top models score below 50% accuracy, while human participants reach about 80% accuracy.
  • Error analysis attributes most model failures to visual attention allocation issues, insufficient internal perceptual manipulation, and weak abstraction of underlying visual concepts.
  • The authors argue that current multimodal LLMs fall short of human-level visuospatial reasoning, and that more cognitively grounded evaluation frameworks are needed.

Abstract

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.