Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
arXiv cs.CV / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper reframes geometry education’s visual explanation task as Referring Image Segmentation (RIS), where a model must generate pixel-level masks for described geometric elements in diagrams.
- It argues that existing RIS models break down on geometry schematics due to a major domain shift from real photos to abstract, textureless diagrams.
- To overcome limited training data, the authors build a fully automated procedural data generation engine producing 200,000+ synthetic geometry diagrams with pixel-perfect masks and diverse natural-language referring expressions.
- They propose domain-specific fine-tuning for vision-language models and report that fine-tuned Florence-2 reaches 49% IoU and 85% Buffered IoU, versus under 1% in zero-shot evaluation.
- The work introduces Buffered IoU, a geometry-aware metric designed to better assess thin-structure localization than standard IoU, and positions these results as groundwork for Artificial General Teachers that can provide visually grounded, step-by-step guidance.
