I fine-tuned a 2B parameter model that beat the 4B, 9B, 27B, and 35B versions of the same model family (Qwen 3.5) on a real product task, evaluated on 161 held-out samples, all gaps statistically significant (p < .0001).
The task: real-time dictation cleanup for VoiceInk, a macOS dictation app I use to talk to coding agents ~vibe~. Raw speech-to-text comes back with filler words, French grammar patterns, and phonetic misrecognitions — "cloud code" instead of "Claude Code", "chicken 17" instead of "chicane 17".
A few things I learned building this:
→ Completions-only training was the single biggest quality lever. Training loss dropped from ~0.85 to ~0.15 by masking loss on everything except the assistant response.
→ A reverse proxy between the app and model server turned normal usage into dataset collection. 1451 real samples, zero annotation effort. Best decision in the project.
→ The model passed eval then broke in production. Long QA debriefs for GT Coach, the sim-racing coaching app I am building, triggered repetition amplification: 3266 words in, 7215 words out. Root cause: 10 training samples over 500 words out of 1451. 160 synthetic samples fixed it.
Total compute cost: under £1 (the main cost came from my Claude Code subscription 😅). Labeling, synthetic data, and evaluation all ran through Claude.
Full write-up with methodology, code, and eval results: https://github.com/hourliert/VoiceInk-Qwen3.5-2B-FT/blob/master/docs/BLOG_POST.md
[link] [comments]




