StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • StructEval is proposed as a new benchmark to evaluate how well LLMs generate structured outputs, covering both non-renderable formats (e.g., JSON/YAML/CSV) and renderable ones (e.g., HTML/React/SVG).
  • The benchmark uses two evaluation paradigms—generation from natural-language prompts and conversion between structured formats—and spans 18 formats with 44 task types.
  • The study introduces novel metrics for format adherence and structural correctness to more systematically test “structural fidelity” than prior benchmarks.
  • Experimental results show sizable capability gaps across models, with even top-tier performance (o1-mini at 75.58 average) leaving room for improvement, and open-source models trailing by about 10 points on average.
  • Generation tasks are found to be harder than conversion tasks, and generating correct visual/visualizable content is harder than producing text-only structured outputs.

Abstract

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and \textbf{2)} conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps-even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.