GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
arXiv cs.AI / 3/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- GeoChallenge introduces a dataset of 90,000 automatically generated multi-choice geometry proof problems that require multi-step reasoning over aligned text and diagrams.
- It provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation of geometric reasoning in LLMs.
- Experiments across advanced LLMs reveal a gap between model performance and human capability, with GPT-5-nano achieving 75.89 exact match versus 94.74 for humans.
- The authors identify three failure patterns: exact-match struggles under MCQ constraints, weak visual reliance, and overextended reasoning without convergence.
- Overall, GeoChallenge aims to enable more reliable evaluation of AI’s geometric reasoning and to illuminate current model limitations.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA