AI Navigate

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

arXiv cs.AI / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • GeoChallenge introduces a dataset of 90,000 automatically generated multi-choice geometry proof problems that require multi-step reasoning over aligned text and diagrams.
  • It provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation of geometric reasoning in LLMs.
  • Experiments across advanced LLMs reveal a gap between model performance and human capability, with GPT-5-nano achieving 75.89 exact match versus 94.74 for humans.
  • The authors identify three failure patterns: exact-match struggles under MCQ constraints, weak visual reliance, and overextended reasoning without convergence.
  • Overall, GeoChallenge aims to enable more reliable evaluation of AI’s geometric reasoning and to illuminate current model limitations.

Abstract

Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.