Ada is the language behind flight controllers, missile guidance, satellite systems, and air traffic control. It's one of the most important languages in safety-critical software — and every major LLM i tested is subpar at it.
I fine-tuned Qwen2.5-Coder-14B-Instruct using QLoRA on a compiler-verified dataset of 3,430 Ada/SPARK instruction pairs. Every single training example passes gnatmake -gnat2022 -gnatwa. The model never trains on broken code.
Custom Ada Compilation Benchmark (1,000 prompts, first-attempt clean compile):
| Model | Size | Compile Rate |
|---|---|---|
| Steelman R5 | 14B | 68.6% |
| Claude Opus 4.6 | — | 42.1% |
| Claude Sonnet 4.6 | — | 37.2% |
| Qwen2.5-Coder-14B (base, untuned) | 14B | ~35% |
| Claude Sonnet 4 | — | 27.5% |
MultiPL-E HumanEval-Ada (157 problems, pass@1):
| Model | Pass@1 | Compile Rate |
|---|---|---|
| Steelman R5 | 47.1% | 74.5% |
| Qwen2.5-Coder-14B (base) | 34.4% | 51.0% |
These are the first published Ada pass@1 results on HumanEval for any open model.
Training details:
- QLoRA 4-bit via Unsloth + TRL SFTTrainer
- LoRA rank 32, alpha 64, targeting q/k/v/o/gate/up/down projections
- Full retrain from base each round on accumulated dataset (adapter continuation caused catastrophic forgetting at R2)
- 1 epoch, lr 2e-5, constant schedule, ~49 minutes per round on a rented H100
- Five rounds (R1–R5), with R2 discarded due to catastrophic forgetting from adapter continuation. Project so far has taken about 2-3 days.
- Dataset includes standard generation, spec-to-body, error-fix, and multi-file tasks
- Named after the 1978 DoD Steelman requirements that defined the Ada language
Try it right now:
ollama run hf.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF Fits in 12GB VRAM with Q4_K_M.
Links:
- Model: https://huggingface.co/the-clanker-lover/steelman-14b-ada-v0.1
- GGUF: https://huggingface.co/the-clanker-lover/steelman-14b-ada-v0.1-GGUF
- Dataset: https://huggingface.co/datasets/the-clanker-lover/steelman-sft-ada
Limitations:
- Compilation ≠ correctness. 68.6% compiles, 47.1% actually produces correct output on HumanEval.
- Error-fix capability is weak (5.1%). Don't expect it to debug your Ada code.
- SPARK contracts compile but aren't verified with gnatprove.
- Synthetically generated training data — no human Ada developers wrote these examples.
- 14B model. It will miss things a bigger model would catch.
[link] [comments]




