Measuring Representation Robustness in Large Language Models for Geometry
arXiv cs.CL / 4/21/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that existing LLM math/geometry benchmarks overestimate reasoning ability by evaluating only fixed problem formats, so failures caused by changing representations can go unnoticed.
- It introduces GeoRepEval, a representation-aware evaluation framework that tests correctness, invariance, and consistency across multiple parallel geometric formulations (e.g., Euclidean, coordinate, vector) using statistical and regression-based controls.
- The authors show that changing the representation alone can produce accuracy gaps of up to 14 percentage points across 11 LLMs evaluated on 158 curated high-school geometry problems.
- Vector formulations are identified as a particularly consistent failure point, with Invariance@3 dropping as low as 0.044 even after controlling for length and symbolic complexity.
- A convert-then-solve prompting strategy can substantially improve vector accuracy (up to +52 percentage points) for high-capacity models, implying representation sensitivity rather than outright inability, while low-capacity models benefit little.
Related Articles

Capsule Security Emerges From Stealth With $7 Million in Funding
Dev.to

Rethinking Coding Education for the AI Era
Dev.to

We Shipped an MVP With Vibe-Coding. Here's What Nobody Tells You About the Aftermath
Dev.to

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents
Dev.to

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work
Dev.to