CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems

arXiv cs.CL / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents “Clarity,” a framework to benchmark interactive NL2SQL systems under realistic ambiguity and unanswerability cases, especially when users provide incomplete clarification.
  • Clarity automatically generates NL2SQL benchmark data by transforming executable SQL into queries with multi-faceted ambiguities, including grounded conversational continuations and schema-level metadata via a constraint-driven pipeline.
  • Experiments on Spider and BIRD show that top NL2SQL systems, including those using strong LLMs, experience substantial performance drops in multi-faceted ambiguity scenarios.
  • The findings suggest that while current systems can often detect ambiguity, they have difficulty precisely identifying (localizing) and resolving the underlying schema-level causes.
  • Overall, the work argues for more robust ambiguity detection and resolution capabilities tailored to industry-grade, interactive NL2SQL deployments.

Abstract

NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.