AI Navigate

Fine-tuned Qwen 3.5 2B to beat same-quant 4B, 9B, 27B, and 35B on a real dictation cleanup task, full pipeline, code, and eval (RTX 4080 Super, under £1 compute)

Reddit r/LocalLLaMA / 3/14/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A 2B-parameter fine-tuned Qwen 3.5 model outperformed larger 4B, 9B, 27B, and 35B variants on a real dictation cleanup task, with 161 held-out samples and statistically significant results (p < .0001).
  • The target task is real-time dictation cleanup for VoiceInk (a macOS dictation app), addressing issues like filler words, French grammar patterns, and phonetic misrecognitions such as misheard code terms.
  • Completions-only training was identified as the main quality lever, dropping training loss from ~0.85 to ~0.15 by masking loss on everything except the assistant's response.
  • A reverse proxy between the app and model server enabled dataset collection from live usage (1451 real samples, zero annotation effort), cited as the best decision in the project.
  • The model passed evaluation but encountered production-relevant issues (repetition amplification); 160 synthetic samples were used to fix it, with total compute costs under £1 and Claude as the primary data/evaluation pipeline; full write-up, code, and results are available on GitHub.

I fine-tuned a 2B parameter model that beat the 4B, 9B, 27B, and 35B versions of the same model family (Qwen 3.5) on a real product task, evaluated on 161 held-out samples, all gaps statistically significant (p < .0001).

The task: real-time dictation cleanup for VoiceInk, a macOS dictation app I use to talk to coding agents ~vibe~. Raw speech-to-text comes back with filler words, French grammar patterns, and phonetic misrecognitions — "cloud code" instead of "Claude Code", "chicken 17" instead of "chicane 17".

A few things I learned building this:

→ Completions-only training was the single biggest quality lever. Training loss dropped from ~0.85 to ~0.15 by masking loss on everything except the assistant response.

→ A reverse proxy between the app and model server turned normal usage into dataset collection. 1451 real samples, zero annotation effort. Best decision in the project.

→ The model passed eval then broke in production. Long QA debriefs for GT Coach, the sim-racing coaching app I am building, triggered repetition amplification: 3266 words in, 7215 words out. Root cause: 10 training samples over 500 words out of 1451. 160 synthetic samples fixed it.

Total compute cost: under £1 (the main cost came from my Claude Code subscription 😅). Labeling, synthetic data, and evaluation all ran through Claude.

Full write-up with methodology, code, and eval results: https://github.com/hourliert/VoiceInk-Qwen3.5-2B-FT/blob/master/docs/BLOG_POST.md

submitted by /u/ComplexNode
[link] [comments]