The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
arXiv cs.CL / 4/29/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces SOB (Structured Output Benchmark), a multi-source benchmark for assessing how well large language models generate structured outputs (e.g., JSON) across native text, images, and audio conversations.
- SOB standardizes the model inputs via a text-normalized representation across modalities to fairly isolate structured-output quality from raw vision or speech-processing performance.
- The benchmark includes 5,000 text records from multi-hop QA, 209 image records from OCR-processed PDFs covering challenging document types, and 115 audio records from the AMI corpus, each requiring a JSON-schema-following answer grounded in source context.
- Across 21 frontier and open-weight models, results show near-perfect schema compliance but substantially lower value correctness, with exact leaf-value match peaking at 83.0% (text), 67.2% (images), and 23.7% (audio), especially as context length increases.
- The authors release the dataset, evaluation pipeline, and all related code to enable reproducible, source-agnostic structured-output evaluation.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

Automatic Error Recovery in AI Agent Networks
Dev.to