ReViSQL: Achieving Human-Level Text-to-SQL

arXiv cs.CL / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • ReViSQL introduces a streamlined framework that achieves human-level accuracy on the BIRD Text-to-SQL benchmark without requiring more complex AI architectures.
  • It relies on reinforcement learning with verifiable rewards (RLVR) applied to BIRD-Verified, a 2.5k-instance dataset curated and corrected by SQL experts, with a data-cleaning workflow that fixed errors in 61.1% of a subset of the BIRD training set.
  • The authors show that improving data quality alone yields an 8.2–13.9% gain in single-generation accuracy under the same RLVR setup.
  • Inference-time scaling via execution-based reconciliation and majority voting further boosts accuracy and reliability.
  • On expert-verified BIRD Mini-Dev, ReViSQL-235B-A22B reaches 93.2% execution accuracy, surpassing proxy human-level accuracy (92.96%) and beating prior open-source SOTA by 9.8%, while the smaller ReViSQL-30B-A3B matches SOTA at 7.5x lower per-query cost.

Abstract

Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5\times lower per-query cost.