AI Navigate

SQLBench: A Comprehensive Evaluation for Text-to-SQL Capabilities of Large Language Models

arXiv cs.CL / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a new dataset designed to mitigate overfitting for Text-to-SQL with LLMs.
  • It formulates five evaluation tasks to comprehensively assess LLM performance across the Text-to-SQL pipeline and across multiple models.
  • The study highlights notable performance disparities among LLMs and derives optimal in-context learning solutions tailored to each task.
  • The findings provide practical insights to facilitate the development of LLM-based Text-to-SQL systems.

Abstract

Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods.Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt templates and design frameworks. Additionally, existing benchmarks inadequately explore the performance of LLMs across the various sub-tasks of the Text-to-SQL process, which hinders the assessment of LLMs' cognitive capabilities and the optimization of LLM-based solutions. To address the aforementioned issues, we firstly construct a new dataset designed to mitigate the risk of overfitting in LLMs. Then we formulate five evaluation tasks to comprehensively assess the performance of diverse methods across various LLMs throughout the Text-to-SQL process.Our study highlights the performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task. These findings offer valuable insights for facilitating the development of LLM-based Text-to-SQL systems.