When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
arXiv cs.CL / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper examines a “structured-output reliability” gap in small (7–9B) language models, where responses must be both mathematically correct and strictly JSON-format compliant.
- Across GSM8K and MATH, naive and reference prompting strategies produce systematic format failures, including cases where output JSON validity drops to 0% even when task accuracy can be high.
- Constrained decoding can enforce syntactic JSON validity but adds significant latency (about 3.6×–8.2×) and can substantially reduce task performance.
- The authors introduce AloLab, an iterative system-prompt optimizer that uses a meta-agent (Claude Sonnet 4.5) via black-box API access; it improves JSON output accuracy to 84–87% on GSM8K and 34–40% on MATH without fine-tuning.
- The format reliability problem also appears in GPT-4o: AloLab achieves about 95.2% valid JSON output accuracy, while a reference prompt yields 0% due to markdown-fence wrapping.
Related Articles
Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to
How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy
Dev.to
13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)
Dev.to
MCP annotations are a UX layer, not a security layer
Dev.to
From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM
Dev.to