Measuring Representation Robustness in Large Language Models for Geometry

arXiv cs.CL / 4/21/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing LLM math/geometry benchmarks overestimate reasoning ability by evaluating only fixed problem formats, so failures caused by changing representations can go unnoticed.
It introduces GeoRepEval, a representation-aware evaluation framework that tests correctness, invariance, and consistency across multiple parallel geometric formulations (e.g., Euclidean, coordinate, vector) using statistical and regression-based controls.
The authors show that changing the representation alone can produce accuracy gaps of up to 14 percentage points across 11 LLMs evaluated on 158 curated high-school geometry problems.
Vector formulations are identified as a particularly consistent failure point, with Invariance@3 dropping as low as 0.044 even after controlling for length and symbolic complexity.
A convert-then-solve prompting strategy can substantially improve vector accuracy (up to +52 percentage points) for high-capacity models, implying representation sensitivity rather than outright inability, while low-capacity models benefit little.

Abstract

Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.