Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Grid2Matrix (G2M), a controlled benchmark for vision-language models that tests whether they can faithfully reconstruct an image-defined grid into a correct color-to-number matrix.
  • Using G2M with increasing visual complexity, the authors observe a sharp early collapse in zero-shot end-to-end performance, where models fail even on relatively small grids rather than degrading smoothly.
  • Probing VLM visual encoders shows they retain substantially more grid information than what the full end-to-end system outputs, indicating the problem is not solely due to visual feature extraction.
  • The authors characterize a structured error pattern that depends on how grid cells align with the model’s visual patch boundaries, and they coin the gap between recoverable visual features and expressed language as “Digital Agnosia.”
  • Common mitigations like scaling and multimodal alignment do not fully remove the failure mode, and G2M is proposed as a testbed for tasks where missing fine visual details matters (e.g., tables, charts, forms, GUIs).

Abstract

Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap \textit{Digital Agnosia}. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.