| There are a lot of SLM options right now and picking the right base model for fine-tuning is a real decision. Qwen3, Llama 3.2, Gemma 3, SmolLM2, Liquid AI's LFM2 - each family has multiple size variants and it's hard to know which one will actually respond best to your training data. We ran a systematic benchmark to answer this with data instead of vibes. Setup: 15 models, 9 diverse tasks (classification, information extraction, document understanding, open-book QA, closed-book QA, tool calling), all fine-tuned with identical hyperparameters (4 epochs, lr 5e-5, LoRA rank 64). Training data: 10k synthetic examples per task generated from a 120B+ teacher. Results aggregated using rank-based averaging across all benchmarks with 95% confidence intervals. Models tested: Qwen3-8B, Qwen3-4B-Instruct-2507, Qwen3-1.7B, Qwen3-0.6B, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Llama-3.2-1B-Instruct, LFM2-350M, LFM2-1.2B, LFM2-2.6B-Exp, LFM2.5-1.2B-Instruct, SmolLM2-1.7B-Instruct, SmolLM2-135M-Instruct, gemma-3-1b-it, gemma-3-270m-it. Best fine-tuned performanceQwen3-8B takes the top spot with an average rank of 2.33 and the tightest confidence interval (±0.57) of any model. It's not just good, it's consistently good across every task type. Here's the top 6:
Notable: Llama-3.2-3B ties with Llama-3.1-8B at rank 4.11, but with a tighter CI. So if you're memory-constrained, the 3B Llama is a solid pick over the 8B. Most tunable (biggest gains from fine-tuning)This is where it gets interesting. Liquid AI's LFM2 family sweeps the top three spots:
LFM2-350M has just 350M parameters but absorbs training signal more effectively than models 4-20x its size. The CI of ±0.89 means this isn't a fluke on one or two tasks, it improves consistently everywhere. If you're deploying on edge hardware or embedded devices, this is a big deal. The larger models (Qwen3-8B, Qwen3-4B) rank near the bottom for tunability, which makes sense: they already perform well at baseline, so there's less room for improvement. Can a fine-tuned 4B model match a 120B+ teacher?Yes. Here's Qwen3-4B-Instruct-2507 vs the GPT-OSS-120B teacher:
The 4B student beats the 120B teacher on 8 of 9 benchmarks. The SQuAD 2.0 result (+19 points) is particularly striking: fine-tuning embeds domain knowledge more effectively than prompting a model 30x larger. Practical recommendations
The bottom line: fine-tuning matters more than base model choice. A well-tuned 1B model can outperform a prompted 8B model. Full post with charts, methodology details, and the raw results: https://www.distillabs.ai/blog/what-small-language-model-is-best-for-fine-tuning [link] [comments] |
We benchmarked 15 small language models across 9 tasks to find which one you should actually fine-tune. Here are the results.
Reddit r/LocalLLaMA / 3/17/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The benchmark evaluated 15 small language models across 9 tasks—from classification to tool calling—using identical fine-tuning settings (4 epochs, lr 5e-5, LoRA rank 64) and 10k synthetic examples per task.
- Qwen3-8B achieved the top average rank of 2.33 with the tightest 95% CI, showing consistent performance across all task types.
- Llama-3.2-3B matched Llama-3.1-8B in rank but with a tighter confidence interval, making the 3B Llama variant a strong memory-efficient option.
- In the most tunable category, Liquid AI's LFM2 family dominated, with LFM2-350M, LFM2-1.2B, and LFM2.5-1.2B-Instruct leading the pack in fine-tuning gains.
Related Articles

Astral to Join OpenAI
Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic
Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.
Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA