I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured...

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A developer published an agentic text-to-SQL benchmark at https://sql-benchmark.nicklothian.com/ that converts English questions into SQL, runs the queries, and uses limited debugging rounds to correct errors.
  • The benchmark is designed to be short (25 questions) and fast (under 5 minutes for most models) to enable practical comparisons across different LLM configurations.
  • Results highlighted that top open models include kimi-k2.5, Qwen 3.5 397B-A17B, and Qwen 3.5 27B, alongside NVIDIA Nemotron-Cascade-2-30B-A3B performing very strongly (including matching Codex 5.3) in the tests.
  • The author added a way to run the benchmark against a user’s own server using a WASM version of llama.cpp, lowering the barrier for local evaluation.
  • The post invites the community to share scores and feedback for improving a potential v2 of the benchmark.
I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured...

Last week I asked for some feedback about what extra models I should test. I've added them all and now the benchmark is available at https://sql-benchmark.nicklothian.com/

I didn't say a lot about what the agent at the time, but in simple terms it takes an English query like "Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory" and turns it into SQL that it tests against a set of database tables.

It gets to see the query results and can modify it to fix issues, but with a limit to the number of debugging rounds it gets.

The benchmark is deliberately short (25 questions) and fast to run (much less than 5 minutes for most models) so you can try different configurations etc, but it is tough enough to separate the best models from the others.

I added the ability to run it yourself against your own server (thanks to the WASM version of Llama.cpp).

A few of the things I found interesting:

  • The best open models are kimi-k2.5, Qwen 3.5 397B-A17B and Qwen 3.5 27B (!)
  • NVIDIA Nemotron-Cascade-2-30B-A3B outscores Qwen 3.5-35B-A3B and matches Codex 5.3
  • Mimo v2 Flash is a gem of a model

I'd love to see some scores people get, as well as what I should change for v2!

submitted by /u/nickl
[link] [comments]