Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

arXiv cs.AI / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • Existing text-to-SQL (T2SQL) evaluation methods often assume access to ground-truth SQL queries and structured database schemas, which are uncommon in real production deployments.
  • The paper introduces STEF, a schema-agnostic, production-native evaluation framework that scores SQL accuracy using only natural-language inputs (question and reformulation) plus the generated SQL, without needing schemas or reference queries.
  • STEF extracts semantic specifications from both the natural language and SQL, aligns normalized features, and outputs an interpretable 0–100 score based on a composite metric covering filter alignment, semantic verdict, and evaluator confidence.
  • The framework also adds production-friendly capabilities such as enriched question quality validation, configurable app-specific rule injection via prompt templating, and robust normalization heuristics for GROUP BY / ORDER BY / LIMIT behaviors.
  • Experimental results claim STEF enables continuous production monitoring and feedback loops for T2SQL agents at scale, making structured query evaluation more feasible without schema dependency.

Abstract

Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural language and SQL representations, performs normalized feature alignment, and produces an interpretable 0 to 100 accuracy score via a composite metric that encompasses filter alignment, semantic verdict, and confidence of the evaluator. Key contributions include: enriched question quality validation as a first-class evaluation signal, configurable application-specific rule injection via prompt templating, and production-robust normalization handling GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.