How difficult is distilling?

Reddit r/LocalLLaMA / 5/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep Analysis

Key Points

  • The article (a Reddit question) asks why more distilled models are not widely seen after DeepSeek R1 was quickly distilled into smaller models such as Llama 3 8B and Qwen 2.5 ~7B.
  • It specifically probes how difficult distillation is, including the practical effort required to produce a smaller student model from a larger teacher.
  • It asks about the cost implications of distillation, focusing on compute expenses and overall feasibility for practitioners.
  • It also seeks quantitative guidance on resource requirements, such as how many tokens or prompts are needed to achieve useful distillation results.

I remember a year or so ago when DeepSeek R1 came out and it was pretty quickly distilled into Llama 3 8b and Qwen 2.5 (?) 7b. Why don’t we see more distilled models? How expensive is it? How many tokens or prompts does it take?

submitted by /u/GreedyWorking1499
[link] [comments]