From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The study investigates whether intermediate structured representations can help LLMs generate executable legal decision models from legal text, addressing the high cost of manual coding and evaluation in legal informatics.
  • Using a real-world dataset linking Dutch Environment and Planning Act text to production decision models powering the Omgevingsloket platform, the authors compare four enrichment strategies (raw text, semantic role labels, I/O constraints, and both together).
  • The strongest gains come from adding input/output constraints, improving structural similarity by about 37–54% over the baseline, while semantic role labels yield only modest improvements.
  • On functional (outcome) evaluation, generated models match the gold standard on 51–53% of pre-configured test scenarios, and the generated models tend to be smaller and simpler.
  • Structural similarity and outcome equivalence are found to be complementary—high structural overlap does not necessarily imply correct behavior, and behavioral correctness does not always follow from structural similarity—and the authors release the dataset (95 models) and full experimental code for reproducibility.

Abstract

Transforming legal text into executable decision logic is a longstanding challenge in legal informatics. With the rise of LLMs, this task has gained renewed interest, but remains challenging due to requiring extensive manual coding and evaluation. We use a unique real-world dataset that pairs production-grade decision models with legal text from the Dutch Environment and Planning Act. These models power the Omgevingsloket government platform, where citizens check permit requirements for environmental activities. We study whether intermediate structured representations can improve LLM-based generation of executable decision models from legal text. We compare four input conditions: raw legal text, text enriched with semantic role labels, text enriched with input and output constraints, and text enriched with both. We evaluate along two dimensions: structural evaluation, through similarity to gold decision models with graph kernels and graphs' descriptive statistics, and outcome evaluation, through functional equivalence by executing models on pre-configured test scenarios. Our findings show that I/O constraints provide the dominant improvement (+37-54% similarity over baseline), while semantic role labels show modest improvements. Outcome evaluation shows that generated models match the gold standard on 51-53% of test scenarios, even though generated models are typically smaller and simpler. We find LLMs eliminate redundant pass-through logic that comprises up to 45-55% of nodes. Importantly, structural similarity and outcome equivalence are complementary: structural similarity does not guarantee outcome equivalence, and vice versa. To facilitate reproducibility, we publicly release our dataset of 95 production decision models with associated legal text and all experimental code.