Fine-tuned Qwen 3.5 2B to beat same-quant 4B, 9B, 27B, and 35B on a real dictation cleanup task, full pipeline, code, and eval (RTX 4080 Super, under £1 compute)

Reddit r/LocalLLaMA / 3/14/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A 2B-parameter fine-tuned Qwen 3.5 model outperformed larger 4B, 9B, 27B, and 35B variants on a real dictation cleanup task, with 161 held-out samples and statistically significant results (p < .0001).
The target task is real-time dictation cleanup for VoiceInk (a macOS dictation app), addressing issues like filler words, French grammar patterns, and phonetic misrecognitions such as misheard code terms.
Completions-only training was identified as the main quality lever, dropping training loss from ~0.85 to ~0.15 by masking loss on everything except the assistant's response.
A reverse proxy between the app and model server enabled dataset collection from live usage (1451 real samples, zero annotation effort), cited as the best decision in the project.
The model passed evaluation but encountered production-relevant issues (repetition amplification); 160 synthetic samples were used to fix it, with total compute costs under £1 and Claude as the primary data/evaluation pipeline; full write-up, code, and results are available on GitHub.

I fine-tuned a 2B parameter model that beat the 4B, 9B, 27B, and 35B versions of the same model family (Qwen 3.5) on a real product task, evaluated on 161 held-out samples, all gaps statistically significant (p < .0001).

The task: real-time dictation cleanup for VoiceInk, a macOS dictation app I use to talk to coding agents ~vibe~. Raw speech-to-text comes back with filler words, French grammar patterns, and phonetic misrecognitions — "cloud code" instead of "Claude Code", "chicken 17" instead of "chicane 17".

A few things I learned building this:

→ Completions-only training was the single biggest quality lever. Training loss dropped from ~0.85 to ~0.15 by masking loss on everything except the assistant response.

→ A reverse proxy between the app and model server turned normal usage into dataset collection. 1451 real samples, zero annotation effort. Best decision in the project.

→ The model passed eval then broke in production. Long QA debriefs for GT Coach, the sim-racing coaching app I am building, triggered repetition amplification: 3266 words in, 7215 words out. Root cause: 10 training samples over 500 words out of 1451. 160 synthetic samples fixed it.

Total compute cost: under £1 (the main cost came from my Claude Code subscription 😅). Labeling, synthetic data, and evaluation all ran through Claude.

Full write-up with methodology, code, and eval results: https://github.com/hourliert/VoiceInk-Qwen3.5-2B-FT/blob/master/docs/BLOG_POST.md

submitted by /u/ComplexNode
[link] [comments]