When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines a “structured-output reliability” gap in small (7–9B) language models, where responses must be both mathematically correct and strictly JSON-format compliant.
  • Across GSM8K and MATH, naive and reference prompting strategies produce systematic format failures, including cases where output JSON validity drops to 0% even when task accuracy can be high.
  • Constrained decoding can enforce syntactic JSON validity but adds significant latency (about 3.6×–8.2×) and can substantially reduce task performance.
  • The authors introduce AloLab, an iterative system-prompt optimizer that uses a meta-agent (Claude Sonnet 4.5) via black-box API access; it improves JSON output accuracy to 84–87% on GSM8K and 34–40% on MATH without fine-tuning.
  • The format reliability problem also appears in GPT-4o: AloLab achieves about 95.2% valid JSON output accuracy, while a reference prompt yields 0% due to markdown-fence wrapping.

Abstract

Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and in several settings degrades task performance substantially. To overcome this limitation, we developed AloLab, an iterative system-prompt optimizer (meta-agent: Claude Sonnet 4.5) requiring only black-box API access to the target model; it reaches 84-87% output accuracy on GSM8K and 34-40% on MATH across five independent runs per model, with 29/30 paired McNemar comparisons against the best static prompt significant at p < 0.05, at near-NAIVE inference latency and without model fine-tuning. The same format failure extends to GPT-4o (OpenAI, 2024), a proprietary closed-source model: REFERENCE achieves 0% output accuracy due to systematic markdown-fence wrapping, while AloLab reaches 95.2% [94.8, 95.6]. An ablation replacing the Sonnet 4.5 meta-agent with Claude 3 Haiku reduces mean output accuracy to 61.0% and increases run-to-run standard deviation from <1 pp to 21.8 pp, confirming that meta-agent capability is a primary driver of optimization quality.