Optimizing Small Language Models for NL2SQL via Chain-of-Thought Fine-Tuning

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how fine-tuning can improve NL2SQL systems, aiming to make SQL generation usable at enterprise scale despite high inference costs of large LLMs.
It finds a counter-intuitive scaling result: fine-tuning large models on standard NL2SQL datasets provides negligible benefits and can even cause overfitting on complex queries.
In contrast, fine-tuning small models (e.g., Qwen) yields substantial gains, raising performance from 36% to 45% on the baseline.
Adding explicit Chain-of-Thought (CoT) reasoning into the training data further boosts accuracy to 54.5%, improving reasoning transfer from larger systems to smaller, cheaper models.
The authors conclude that small, compute-efficient models can reach production-relevant performance targets by learning reasoning patterns, enabling lower cost and latency deployments even if large-model accuracy remains higher.

Abstract

Translating Natural Language to SQL (NL2SQL) remains a critical bottleneck for democratization of data in enterprises. Although Large Language Models (LLMs) like Gemini 2.5 and other LLMs have demonstrated impressive zero-shot capabilities, their high inference costs limit deployment at scale. This paper explores the efficacy of fine-tuning both large and small language models on NL2SQL tasks. Our research reveals a counter-intuitive scaling phenomenon. Fine-tuning large models (Gemini 2.5 Flash/Lite) on standard datasets yields negligible returns, often leading to overfitting on complex queries. Conversely, small models (Qwen) show significant gains. Fine-tuning improved the small model baseline from 36% to 45%, and further enriching the dataset with explicit Chain-of-Thought (CoT) reasoning surged accuracy to 54.5%(Fig 2). While this is still lower than the accuracy of large models like Gemini 2.5 , it does serve the business goal of significant cost reduction, latency in inference time and also meeting the business critical performance accuracy threshold.This paper demonstrates that transferring reasoning patterns enables compute-efficient smaller models to approach production-grade performance.

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

Scaffolded Test-First Prompting: Get Correct Code From the First Run

Dev.to

Optimizing Small Language Models for NL2SQL via Chain-of-Thought Fine-Tuning

Key Points

Abstract

Related Articles

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Scaffolded Test-First Prompting: Get Correct Code From the First Run

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer