GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

arXiv cs.AI / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

GeoChallenge introduces a dataset of 90,000 automatically generated multi-choice geometry proof problems that require multi-step reasoning over aligned text and diagrams.
It provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation of geometric reasoning in LLMs.
Experiments across advanced LLMs reveal a gap between model performance and human capability, with GPT-5-nano achieving 75.89 exact match versus 94.74 for humans.
The authors identify three failure patterns: exact-match struggles under MCQ constraints, weak visual reliance, and overextended reasoning without convergence.
Overall, GeoChallenge aims to enable more reliable evaluation of AI’s geometric reasoning and to illuminate current model limitations.

Abstract

Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.

Is AI becoming a bubble, and could it end like the dot-com crash?

Reddit r/artificial

Externalizing State

Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Dev.to

My AI Does Not Have a Clock

Dev.to

How to settle on a coding LLM ? What parameters to watch out for ?

Reddit r/LocalLLaMA

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Key Points

Abstract

Related Articles

Is AI becoming a bubble, and could it end like the dot-com crash?

Externalizing State

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

My AI Does Not Have a Clock

How to settle on a coding LLM ? What parameters to watch out for ?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer