SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

arXiv cs.CL / 4/9/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that strong Text-to-SQL benchmark scores do not guarantee structural reliability of LLM-generated SQL, motivating evaluation beyond execution correctness.
It introduces SQLStructEval, which uses canonical AST representations to analyze and compare the program structures of generated SQL queries.
Experiments on the Spider benchmark find that modern LLMs can generate structurally diverse SQL for the same question, even when the results execute correctly.
The structural variance is often triggered by surface-level changes such as paraphrases or different schema presentation formats.
The authors show that generating SQL through a compile-style, structured pipeline can improve both execution accuracy and structural consistency, highlighting structural reliability as an overlooked evaluation dimension.

Abstract

Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer