AI Navigate

I fine-tuned a 14B model that outperforms Claude Opus 4.6 on Ada code generation

Reddit r/LocalLLaMA / 3/14/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • A 14B model named Steelman R5 was fine-tuned with QLoRA on a compiler-verified Ada/SPARK dataset (3,430 instruction pairs) and achieves the first published Ada pass@1 results on HumanEval for open models.
  • In a custom Ada Compilation Benchmark, Steelman R5 reached a 68.6% compile rate, outperforming Claude Opus 4.6 (42.1%) and Claude Sonnet 4.6 (37.2%).
  • On MultiPL-E HumanEval-Ada (157 problems, pass@1), Steelman R5 achieved 47.1% pass@1 with 74.5% compile rate, higher than the base Qwen2.5-Coder-14B (34.4% pass@1, 51.0% compile rate).
  • Training details include 4-bit QLoRA with Unsloth and TRL SFTTrainer, LoRA rank 32, five rounds (R2 discarded due to catastrophic forgetting), 1 epoch per round, lr 2e-5, ~49 minutes per round on an H100; the model runs in 12GB VRAM with Q4_K_M and can be tried via the provided Ollama command.

Ada is the language behind flight controllers, missile guidance, satellite systems, and air traffic control. It's one of the most important languages in safety-critical software — and every major LLM i tested is subpar at it.

I fine-tuned Qwen2.5-Coder-14B-Instruct using QLoRA on a compiler-verified dataset of 3,430 Ada/SPARK instruction pairs. Every single training example passes gnatmake -gnat2022 -gnatwa. The model never trains on broken code.

Custom Ada Compilation Benchmark (1,000 prompts, first-attempt clean compile):

Model Size Compile Rate
Steelman R5 14B 68.6%
Claude Opus 4.6 42.1%
Claude Sonnet 4.6 37.2%
Qwen2.5-Coder-14B (base, untuned) 14B ~35%
Claude Sonnet 4 27.5%

MultiPL-E HumanEval-Ada (157 problems, pass@1):

Model Pass@1 Compile Rate
Steelman R5 47.1% 74.5%
Qwen2.5-Coder-14B (base) 34.4% 51.0%

These are the first published Ada pass@1 results on HumanEval for any open model.

Training details:

  • QLoRA 4-bit via Unsloth + TRL SFTTrainer
  • LoRA rank 32, alpha 64, targeting q/k/v/o/gate/up/down projections
  • Full retrain from base each round on accumulated dataset (adapter continuation caused catastrophic forgetting at R2)
  • 1 epoch, lr 2e-5, constant schedule, ~49 minutes per round on a rented H100
  • Five rounds (R1–R5), with R2 discarded due to catastrophic forgetting from adapter continuation. Project so far has taken about 2-3 days.
  • Dataset includes standard generation, spec-to-body, error-fix, and multi-file tasks
  • Named after the 1978 DoD Steelman requirements that defined the Ada language

Try it right now:

ollama run hf.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF 

Fits in 12GB VRAM with Q4_K_M.

Links:

Limitations:

  • Compilation ≠ correctness. 68.6% compiles, 47.1% actually produces correct output on HumanEval.
  • Error-fix capability is weak (5.1%). Don't expect it to debug your Ada code.
  • SPARK contracts compile but aren't verified with gnatprove.
  • Synthetically generated training data — no human Ada developers wrote these examples.
  • 14B model. It will miss things a bigger model would catch.
submitted by /u/clanker-lover
[link] [comments]