AI Navigate

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

arXiv cs.AI / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a benchmark to evaluate multimodal large language models on discrete symbol understanding across language, culture, mathematics, physics, and chemistry.
  • It reports a cognitive mismatch where models struggle with basic symbol recognition yet perform surprisingly well on some reasoning tasks, suggesting reliance on linguistic probabilities rather than true perception.
  • The findings reveal a significant gap in current AI capabilities for truly perceiving and understanding symbolic languages that underpin scientific discovery.
  • The work provides a roadmap for developing more rigorous, human-aligned intelligent systems.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.