Optimizing Small Language Models for NL2SQL via Chain-of-Thought Fine-Tuning

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how fine-tuning can improve NL2SQL systems, aiming to make SQL generation usable at enterprise scale despite high inference costs of large LLMs.
  • It finds a counter-intuitive scaling result: fine-tuning large models on standard NL2SQL datasets provides negligible benefits and can even cause overfitting on complex queries.
  • In contrast, fine-tuning small models (e.g., Qwen) yields substantial gains, raising performance from 36% to 45% on the baseline.
  • Adding explicit Chain-of-Thought (CoT) reasoning into the training data further boosts accuracy to 54.5%, improving reasoning transfer from larger systems to smaller, cheaper models.
  • The authors conclude that small, compute-efficient models can reach production-relevant performance targets by learning reasoning patterns, enabling lower cost and latency deployments even if large-model accuracy remains higher.

Abstract

Translating Natural Language to SQL (NL2SQL) remains a critical bottleneck for democratization of data in enterprises. Although Large Language Models (LLMs) like Gemini 2.5 and other LLMs have demonstrated impressive zero-shot capabilities, their high inference costs limit deployment at scale. This paper explores the efficacy of fine-tuning both large and small language models on NL2SQL tasks. Our research reveals a counter-intuitive scaling phenomenon. Fine-tuning large models (Gemini 2.5 Flash/Lite) on standard datasets yields negligible returns, often leading to overfitting on complex queries. Conversely, small models (Qwen) show significant gains. Fine-tuning improved the small model baseline from 36% to 45%, and further enriching the dataset with explicit Chain-of-Thought (CoT) reasoning surged accuracy to 54.5%(Fig 2). While this is still lower than the accuracy of large models like Gemini 2.5 , it does serve the business goal of significant cost reduction, latency in inference time and also meeting the business critical performance accuracy threshold.This paper demonstrates that transferring reasoning patterns enables compute-efficient smaller models to approach production-grade performance.