CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems

arXiv cs.CL / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents “Clarity,” a framework to benchmark interactive NL2SQL systems under realistic ambiguity and unanswerability cases, especially when users provide incomplete clarification.
Clarity automatically generates NL2SQL benchmark data by transforming executable SQL into queries with multi-faceted ambiguities, including grounded conversational continuations and schema-level metadata via a constraint-driven pipeline.
Experiments on Spider and BIRD show that top NL2SQL systems, including those using strong LLMs, experience substantial performance drops in multi-faceted ambiguity scenarios.
The findings suggest that while current systems can often detect ambiguity, they have difficulty precisely identifying (localizing) and resolving the underlying schema-level causes.
Overall, the work argues for more robust ambiguity detection and resolution capabilities tailored to industry-grade, interactive NL2SQL deployments.

Abstract

NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.