We benchmarked 15 small language models across 9 tasks to find which one you should actually fine-tune. Here are the results.

Reddit r/LocalLLaMA / 3/17/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The benchmark evaluated 15 small language models across 9 tasks—from classification to tool calling—using identical fine-tuning settings (4 epochs, lr 5e-5, LoRA rank 64) and 10k synthetic examples per task.
Qwen3-8B achieved the top average rank of 2.33 with the tightest 95% CI, showing consistent performance across all task types.
Llama-3.2-3B matched Llama-3.1-8B in rank but with a tighter confidence interval, making the 3B Llama variant a strong memory-efficient option.
In the most tunable category, Liquid AI's LFM2 family dominated, with LFM2-350M, LFM2-1.2B, and LFM2.5-1.2B-Instruct leading the pack in fine-tuning gains.

We benchmarked 15 small language models across 9 tasks to find which one you should actually fine-tune. Here are the results.

There are a lot of SLM options right now and picking the right base model for fine-tuning is a real decision. Qwen3, Llama 3.2, Gemma 3, SmolLM2, Liquid AI's LFM2 - each family has multiple size variants and it's hard to know which one will actually respond best to your training data. We ran a systematic benchmark to answer this with data instead of vibes.

Setup: 15 models, 9 diverse tasks (classification, information extraction, document understanding, open-book QA, closed-book QA, tool calling), all fine-tuned with identical hyperparameters (4 epochs, lr 5e-5, LoRA rank 64). Training data: 10k synthetic examples per task generated from a 120B+ teacher. Results aggregated using rank-based averaging across all benchmarks with 95% confidence intervals.

Models tested: Qwen3-8B, Qwen3-4B-Instruct-2507, Qwen3-1.7B, Qwen3-0.6B, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Llama-3.2-1B-Instruct, LFM2-350M, LFM2-1.2B, LFM2-2.6B-Exp, LFM2.5-1.2B-Instruct, SmolLM2-1.7B-Instruct, SmolLM2-135M-Instruct, gemma-3-1b-it, gemma-3-270m-it.

Best fine-tuned performance

Qwen3-8B takes the top spot with an average rank of 2.33 and the tightest confidence interval (±0.57) of any model. It's not just good, it's consistently good across every task type. Here's the top 6:

Model	Avg Rank	95% CI
Qwen3-8B	2.33	±0.57
Qwen3-4B-Instruct-2507	3.33	±1.90
Llama-3.1-8B-Instruct	4.11	±2.08
Llama-3.2-3B-Instruct	4.11	±1.28
Qwen3-1.7B	4.67	±1.79
Qwen3-0.6B	5.44	±2.60

Notable: Llama-3.2-3B ties with Llama-3.1-8B at rank 4.11, but with a tighter CI. So if you're memory-constrained, the 3B Llama is a solid pick over the 8B.

Most tunable (biggest gains from fine-tuning)

This is where it gets interesting. Liquid AI's LFM2 family sweeps the top three spots:

Model	Avg Rank	95% CI
LFM2-350M	2.11	±0.89
LFM2-1.2B	3.44	±2.24
LFM2.5-1.2B-Instruct	4.89	±1.62

LFM2-350M has just 350M parameters but absorbs training signal more effectively than models 4-20x its size. The CI of ±0.89 means this isn't a fluke on one or two tasks, it improves consistently everywhere. If you're deploying on edge hardware or embedded devices, this is a big deal.

The larger models (Qwen3-8B, Qwen3-4B) rank near the bottom for tunability, which makes sense: they already perform well at baseline, so there's less room for improvement.

Can a fine-tuned 4B model match a 120B+ teacher?

Yes. Here's Qwen3-4B-Instruct-2507 vs the GPT-OSS-120B teacher:

Benchmark	Teacher	Qwen3-4B Finetuned	Δ
TREC	0.90	0.93	+0.03
Banking77	0.92	0.89	-0.03
Docs	0.82	0.84	+0.02
Ecommerce	0.88	0.90	+0.03
PII Redaction	0.81	0.83	+0.02
Roman Empire QA	0.75	0.80	+0.05
Smart Home	0.92	0.96	+0.04
SQuAD 2.0	0.52	0.71	+0.19
Voice Assistant	0.92	0.95	+0.03

The 4B student beats the 120B teacher on 8 of 9 benchmarks. The SQuAD 2.0 result (+19 points) is particularly striking: fine-tuning embeds domain knowledge more effectively than prompting a model 30x larger.

Practical recommendations

Max accuracy: Qwen3-8B
Strong accuracy, smaller footprint: Qwen3-4B-Instruct-2507
Under 2B params: Qwen3-0.6B or Llama-3.2-1B-Instruct
Max fine-tuning ROI: LFM2-350M or LFM2-1.2B
Ultra-compact / IoT: LFM2-350M
No fine-tuning possible: Qwen3-8B (best zero-shot)

The bottom line: fine-tuning matters more than base model choice. A well-tuned 1B model can outperform a prompted 8B model.

Full post with charts, methodology details, and the raw results: https://www.distillabs.ai/blog/what-small-language-model-is-best-for-fine-tuning

submitted by /u/party-horse
[link] [comments]

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

We benchmarked 15 small language models across 9 tasks to find which one you should actually fine-tune. Here are the results.

Key Points

Best fine-tuned performance

Most tunable (biggest gains from fine-tuning)

Can a fine-tuned 4B model match a 120B+ teacher?

Practical recommendations

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer