EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces EduIllustrate, a new benchmark that evaluates LLMs on generating multimodal, diagram-rich educational explanations for K-12 STEM problems beyond traditional Q&A tutoring tasks.
  • EduIllustrate includes 230 problems across five subjects and three grade levels, along with a standardized sequential-anchoring generation protocol to keep visuals consistent across multiple diagrams.
  • An 8-dimension rubric grounded in multimedia learning theory assesses both text quality and visual quality, enabling more comprehensive measurement of educational content generation.
  • Experiments across ten LLMs show a performance gap (Gemini 3.0 Pro Preview at 87.8%) and highlight cost efficiency differences (Kimi-K2.5 at 80.8% with $0.12 per problem).
  • Ablation results indicate sequential anchoring improves visual consistency by 13% while reducing evaluation cost, and human studies support that LLM-as-judge is reliable for objective criteria but weaker for subjective visual judgments.

Abstract

Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation -- the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro Preview leads at 87.8\%, while Kimi-K2.5 achieves the best cost-efficiency (80.8\% at \\0.12/problem). Workflow ablation confirms sequential anchoring improves Visual Consistency by 13\% at 94\% lower cost. Human evaluation with 20 expert raters validates LLM-as-judge reliability for objective dimensions (\rho \geq 0.83$) while revealing limitations on subjective visual assessment.