CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement

arXiv cs.CV / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates a key limitation in educational diagram generation: open-source diffusion models can produce attractive visuals but often garble text labels, while code/LLM-based approaches preserve label correctness but look visually flat.
  • It evaluates three paradigms (diffusion, code/LLM, and closed APIs) on 400 K-12 diagram prompts using both automated and human assessments for label fidelity and visual quality.
  • To address the accuracy–aesthetics gap, the authors propose CAGE (Code-Anchored Generative Enhancement), where an LLM generates executable code for a structurally correct diagram and a diffusion model (via ControlNet conditioning) refines it for visual quality without breaking labels.
  • The work also introduces EduDiagram-2K, a dataset of 2,000 paired programmatic and stylized diagrams designed to support and benchmark the proposed pipeline.
  • Results are presented as proof-of-concept along with a research agenda aimed at advancing multimedia/educational content generation quality at scale.

Abstract

Educational diagrams -- labeled illustrations of biological processes, chemical structures, physical systems, and mathematical concepts -- are essential cognitive tools in K-12 instruction. Yet no existing method can generate them both accurately and engagingly. Open-source diffusion models produce visually rich images but catastrophically garble text labels. Code-based generation via LLMs guarantees label correctness but yields visually flat outputs. Closed-source APIs partially bridge this gap but remain unreliable and prohibitively expensive at educational scale. We quantify this accuracy-aesthetics dilemma across all three paradigms on 400 K-12 diagram prompts, measuring both label fidelity and visual quality through complementary automated and human evaluation protocols. To resolve it, we propose CAGE (Code-Anchored Generative Enhancement): an LLM synthesizes executable code producing a structurally correct diagram, then a diffusion model conditioned on the programmatic output via ControlNet refines it into a visually polished graphic while preserving label fidelity. We also introduce EduDiagram-2K, a collection of 2,000 paired programmatic-stylized diagrams enabling this pipeline, and present proof-of-concept results and a research agenda for the multimedia community.