The Structured Output Benchmark (SOB) - validates both JSON parse and value accuracy [R]

Reddit r/MachineLearning / 4/29/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The Structured Output Benchmark (SOB) highlights that many existing structured output benchmarks focus on JSON validity (schema/types) but miss a more common failure mode: incorrect or hallucinated JSON values.
SOB evaluates structured outputs using seven metrics, including Value Accuracy (exact leaf-value match), Faithfulness (grounded vs hallucinated), and additional structural metrics such as JSON Pass Rate, Type Safety, Path Recall, and Structure Coverage.
Results indicate a notable gap between JSON-schema pass rates (often 90%+ ) and value accuracy, showing that models can produce valid JSON while still extracting incorrect values.
Open-source models perform strongly in the overall ranking, with GLM 4.7 reportedly placing second just below GPT 5.4, and performance is further analyzed by modality (text, image, audio).
The project provides open-source code and dataset and aims to drive progress for deterministic, controllable structured outputs by benchmarking and holding models and the industry to higher standards.

Current structured output benchmarks only validate pass rate for json schema and types, however more commonly the issue tends to be inaccurate json values.

For example hallucinated `total_price` number when extracting value from a invoice or an array ordered wrongly because of inaccurate date mapping.

The Structured output benchmark measures 7 key metrics instead of json schema.

Value Accuracy (primary): exact leaf-value match against verified ground truth
JSON Pass Rate, Type Safety, Path Recall, Structure Coverage (structural)
Faithfulness: are values grounded in context or hallucinated?
Perfect Response: every single leaf value correct
Modalities: text, image and audio

Overall results

Overall benchmark results

Open source is doing pretty well with GLM 4.7 coming number 2 right below GPT 5.4.

JSON-pass vs Value-Accuracy gap

JSON-pass vs Value-Accuracy gap

What's interesting here is that while most models hit 90%+ on JSON schema pass, all of them drop significantly on value accuracy.

Overall best by modality

Overall best by modality

Full breakdown blog: https://interfaze.ai/blog/introducing-structured-output-benchmark
Full leaderboard: https://interfaze.ai/leaderboards/structured-output-benchmark
Paper: https://interfaze.ai/sob_paper.pdf (Pending arXiv)

The full break down goes deeper into different modalities, how we designed the dataset, and how we performed the benchmark. All code and dataset is open source 😄

Our goal is to be the best general model for deterministic tasks and a key aspect of determinism is controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves and the industry against the best.

submitted by /u/404llm
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

How are LLMs 'corrected' when users identify them spreading misinformation or saying something harmful?

Reddit r/artificial

The future of software development: Now with less software development

The Register

The Landing: Portable Payload for AI Systems

Reddit r/artificial

AI Failures Happen When No One is Looking. Here's How to Fix Them.

Dev.to

I Made a CLI That Yells at Your Code Until It Gets an A

Dev.to

The Structured Output Benchmark (SOB) - validates both JSON parse and value accuracy [R]

Key Points

💡 Insights using this article

Related Articles

How are LLMs 'corrected' when users identify them spreading misinformation or saying something harmful?

The future of software development: Now with less software development

The Landing: Portable Payload for AI Systems

AI Failures Happen When No One is Looking. Here's How to Fix Them.

I Made a CLI That Yells at Your Code Until It Gets an A

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer