Limits of Imagery Reasoning in Frontier LLM Models

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether adding an external “Imagery Module” to a frontier LLM can improve performance on spatial reasoning tasks like mental rotation via 3D model rendering and manipulation.
  • Using a dual-module architecture (reasoning MLLM plus imagery rendering/rotation tool), results are worse than expected, with accuracy reaching only up to 62.5%.
  • Even after outsourcing parts of maintaining and manipulating a holistic 3D state to the imagery tool, the combined system still fails to achieve robust spatial reasoning.
  • The findings suggest current frontier LLMs lack core visual-spatial primitives, including low-level sensitivity to depth/motion/dynamic prediction and the ability to conduct contemplative, dynamically focused reasoning over images.

Abstract

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.