LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation

arXiv cs.CL / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces LLM StructCore, a contract-driven two-stage system for filling Dyspnea clinical case report forms with a strict 134-item output schema.
Instead of predicting all fields at once, Stage 1 generates a stable Schema-Guided Reasoning (SGR)-style JSON summary containing exactly nine domain keys.
Stage 2 is a deterministic, 0-LLM “compiler” that parses Stage 1 output, canonicalizes item names, normalizes to the official controlled vocabulary, gates evidence to reduce false positives, and expands predictions to the full 134-item format.
Experiments for CL4Health 2026 show strong results (dev80 macro-F1 up to 0.6543 EN and 0.6905 IT, with an English Codabench hidden score of 0.63) and demonstrate language-agnostic performance between English and Italian.
The approach is motivated by the extreme sparsity of known fields and scoring penalties for both empty values and unsupported predictions, emphasizing precision through schema constraints and deterministic post-processing.

Abstract

Automatically filling Case Report Forms (CRFs) from clinical notes is challenging due to noisy language, strict output contracts, and the high cost of false positives. We describe our CL4Health 2026 submission for Dyspnea CRF filling (134 items) using a contract-driven two-stage design grounded in Schema-Guided Reasoning (SGR). The key task property is extreme sparsity: the majority of fields are unknown, and official scoring penalizes both empty values and unsupported predictions. We shift from a single-step "LLM predicts 134 fields" approach to a decomposition where (i) Stage 1 produces a stable SGR-style JSON summary with exactly 9 domain keys, and (ii) Stage 2 is a fully deterministic, 0-LLM compiler that parses the Stage 1 summary, canonicalizes item names, normalizes predictions to the official controlled vocabulary, applies evidence-gated false-positive filters, and expands the output into the required 134-item format. On the dev80 split, the best teacher configuration achieves macro-F1 0.6543 (EN) and 0.6905 (IT); on the hidden test200, the submitted English variant scores 0.63 on Codabench. The pipeline is language-agnostic: Italian results match or exceed English with no language-specific engineering.