AI Navigate

We benchmarked 15 small language models across 9 tasks to find which one you should actually fine-tune. Here are the results.

Reddit r/LocalLLaMA / 3/17/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The benchmark evaluated 15 small language models across 9 tasks—from classification to tool calling—using identical fine-tuning settings (4 epochs, lr 5e-5, LoRA rank 64) and 10k synthetic examples per task.
  • Qwen3-8B achieved the top average rank of 2.33 with the tightest 95% CI, showing consistent performance across all task types.
  • Llama-3.2-3B matched Llama-3.1-8B in rank but with a tighter confidence interval, making the 3B Llama variant a strong memory-efficient option.
  • In the most tunable category, Liquid AI's LFM2 family dominated, with LFM2-350M, LFM2-1.2B, and LFM2.5-1.2B-Instruct leading the pack in fine-tuning gains.
We benchmarked 15 small language models across 9 tasks to find which one you should actually fine-tune. Here are the results.

There are a lot of SLM options right now and picking the right base model for fine-tuning is a real decision. Qwen3, Llama 3.2, Gemma 3, SmolLM2, Liquid AI's LFM2 - each family has multiple size variants and it's hard to know which one will actually respond best to your training data. We ran a systematic benchmark to answer this with data instead of vibes.

Setup: 15 models, 9 diverse tasks (classification, information extraction, document understanding, open-book QA, closed-book QA, tool calling), all fine-tuned with identical hyperparameters (4 epochs, lr 5e-5, LoRA rank 64). Training data: 10k synthetic examples per task generated from a 120B+ teacher. Results aggregated using rank-based averaging across all benchmarks with 95% confidence intervals.

Models tested: Qwen3-8B, Qwen3-4B-Instruct-2507, Qwen3-1.7B, Qwen3-0.6B, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Llama-3.2-1B-Instruct, LFM2-350M, LFM2-1.2B, LFM2-2.6B-Exp, LFM2.5-1.2B-Instruct, SmolLM2-1.7B-Instruct, SmolLM2-135M-Instruct, gemma-3-1b-it, gemma-3-270m-it.

Best fine-tuned performance

Qwen3-8B takes the top spot with an average rank of 2.33 and the tightest confidence interval (±0.57) of any model. It's not just good, it's consistently good across every task type. Here's the top 6:

Model Avg Rank 95% CI
Qwen3-8B 2.33 ±0.57
Qwen3-4B-Instruct-2507 3.33 ±1.90
Llama-3.1-8B-Instruct 4.11 ±2.08
Llama-3.2-3B-Instruct 4.11 ±1.28
Qwen3-1.7B 4.67 ±1.79
Qwen3-0.6B 5.44 ±2.60

Notable: Llama-3.2-3B ties with Llama-3.1-8B at rank 4.11, but with a tighter CI. So if you're memory-constrained, the 3B Llama is a solid pick over the 8B.

Most tunable (biggest gains from fine-tuning)

This is where it gets interesting. Liquid AI's LFM2 family sweeps the top three spots:

Model Avg Rank 95% CI
LFM2-350M 2.11 ±0.89
LFM2-1.2B 3.44 ±2.24
LFM2.5-1.2B-Instruct 4.89 ±1.62

LFM2-350M has just 350M parameters but absorbs training signal more effectively than models 4-20x its size. The CI of ±0.89 means this isn't a fluke on one or two tasks, it improves consistently everywhere. If you're deploying on edge hardware or embedded devices, this is a big deal.

The larger models (Qwen3-8B, Qwen3-4B) rank near the bottom for tunability, which makes sense: they already perform well at baseline, so there's less room for improvement.

Can a fine-tuned 4B model match a 120B+ teacher?

Yes. Here's Qwen3-4B-Instruct-2507 vs the GPT-OSS-120B teacher:

Benchmark Teacher Qwen3-4B Finetuned Δ
TREC 0.90 0.93 +0.03
Banking77 0.92 0.89 -0.03
Docs 0.82 0.84 +0.02
Ecommerce 0.88 0.90 +0.03
PII Redaction 0.81 0.83 +0.02
Roman Empire QA 0.75 0.80 +0.05
Smart Home 0.92 0.96 +0.04
SQuAD 2.0 0.52 0.71 +0.19
Voice Assistant 0.92 0.95 +0.03

The 4B student beats the 120B teacher on 8 of 9 benchmarks. The SQuAD 2.0 result (+19 points) is particularly striking: fine-tuning embeds domain knowledge more effectively than prompting a model 30x larger.

Practical recommendations

  • Max accuracy: Qwen3-8B
  • Strong accuracy, smaller footprint: Qwen3-4B-Instruct-2507
  • Under 2B params: Qwen3-0.6B or Llama-3.2-1B-Instruct
  • Max fine-tuning ROI: LFM2-350M or LFM2-1.2B
  • Ultra-compact / IoT: LFM2-350M
  • No fine-tuning possible: Qwen3-8B (best zero-shot)

The bottom line: fine-tuning matters more than base model choice. A well-tuned 1B model can outperform a prompted 8B model.

Full post with charts, methodology details, and the raw results: https://www.distillabs.ai/blog/what-small-language-model-is-best-for-fine-tuning

submitted by /u/party-horse
[link] [comments]