| Last week I asked for some feedback about what extra models I should test. I've added them all and now the benchmark is available at https://sql-benchmark.nicklothian.com/ I didn't say a lot about what the agent at the time, but in simple terms it takes an English query like "Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory" and turns it into SQL that it tests against a set of database tables. It gets to see the query results and can modify it to fix issues, but with a limit to the number of debugging rounds it gets. The benchmark is deliberately short (25 questions) and fast to run (much less than 5 minutes for most models) so you can try different configurations etc, but it is tough enough to separate the best models from the others. I added the ability to run it yourself against your own server (thanks to the WASM version of Llama.cpp). A few of the things I found interesting:
I'd love to see some scores people get, as well as what I should change for v2! [link] [comments] |
I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured...
Reddit r/LocalLLaMA / 3/30/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A developer published an agentic text-to-SQL benchmark at https://sql-benchmark.nicklothian.com/ that converts English questions into SQL, runs the queries, and uses limited debugging rounds to correct errors.
- The benchmark is designed to be short (25 questions) and fast (under 5 minutes for most models) to enable practical comparisons across different LLM configurations.
- Results highlighted that top open models include kimi-k2.5, Qwen 3.5 397B-A17B, and Qwen 3.5 27B, alongside NVIDIA Nemotron-Cascade-2-30B-A3B performing very strongly (including matching Codex 5.3) in the tests.
- The author added a way to run the benchmark against a user’s own server using a WASM version of llama.cpp, lowering the barrier for local evaluation.
- The post invites the community to share scores and feedback for improving a potential v2 of the benchmark.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to