The joy and pain of training an LLM from scratch

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • mii-llmは、エッジ展開と多言語(欧州言語中心)を意識した小規模LLM「Zagreus」「Nesso」ファミリーの開発手順をまとめた技術レポートを公開した。
  • 0.4B(約4億)パラメータ級の言語モデルをスクラッチから学習し、英語+対象言語のバイリンガル事前学習を軸にイタリア語・スペイン語・フランス語・ポルトガル語へ対応している。
  • 公開モデルには、各言語のベースモデルに加えて、会話用途向けのinstructモデルや、構造化/エージェント的タスク向けのagenticモデル、さらにオープンデータとオープンレシピで構築した「Open-Zagreus」も含まれる。
  • 学習構成として64台のNVIDIA A100、約1兆トークン、Hugging Face Nanotron(事前学習)、Axolotl(ポストトレーニング)、Slurmによるマルチノード運用などが記載されている。
  • 1B未満の規模ではMoE(疎行性)よりも安定性やリソース活用を重視した「dense 0.4Bアーキテクチャ」を選んだ理由も説明している。
The joy and pain of training an LLM from scratch

mii-llm just released a detailed technical report on the development of the Zagreus and Nesso model families: a set of 0.4B parameter language models trained from scratch with a focus on edge deployment, multilingual capability, and European languages.

The report documents the full pipeline behind a family of small language models designed for Italian, Spanish, French, and Portuguese, with bilingual pretraining centered on English + target language settings.

Released models

Training setup

According to the report, the project used:

  • 64 NVIDIA A100 GPUs
  • ~1 trillion tokens
  • Datatrove for tokenization
  • Hugging Face Nanotron for pretraining
  • Axolotl for post-training
  • Slurm for multi-node orchestration

The report also explains why a dense 0.4B architecture was selected instead of MoE, arguing that in the sub-1B regime, stability and utilization can matter more than sparse efficiency.

Why this is interesting

A lot of current discussion focuses on frontier-scale models, but this report is a useful example of the opposite direction: small models trained from scratch for practical multilingual edge scenarios.

Some points that stand out:

  • small multilingual models can still be competitive if the pipeline is well engineered
  • post-training has a major effect on usability
  • model behavior differs significantly across Italian and English tasks
  • open pipelines can still produce meaningful results in this size class
  • small models still show clear weaknesses in arithmetic, factual recall, repetition, and domain-specific knowledge

Benchmark notes

The report includes comparisons against Qwen3-0.6B and Qwen3.5-0.8B, along with multilingual evaluations and task-by-task analysis.

A few interesting takeaways:

  • Nesso-0.4B-agentic appears especially strong and consistent on Italian tasks
  • Qwen3.5-0.8B performs better on several English generative tasks
  • Qwen3-0.6B stands out on logic / reasoning-style tasks
  • the fully open variant still achieves competitive results in several settings

Figures

llm-as-judge comparison

https://preview.redd.it/1kw9luyvhpvg1.png?width=1935&format=png&auto=webp&s=f8781a4c64ab51d00853d84120541925d8674c54

https://preview.redd.it/q2hj6vz2ipvg1.png?width=2385&format=png&auto=webp&s=8d4484384743eacbb119896b18f91f894a8eb839

Classical benchmark

https://preview.redd.it/ri1vkdz9gpvg1.png?width=630&format=png&auto=webp&s=f889f5e16366537cc534e50e7921669d8d95fa68

Italian benchmark results

https://preview.redd.it/0ounb0negpvg1.png?width=630&format=png&auto=webp&s=df6fb43e4348795d1a0bd36e98954c6f7afa432e

English benchmark results english-nesso.png

https://preview.redd.it/ttq58dtggpvg1.png?width=630&format=png&auto=webp&s=b2f029b6c6cf310176e11f419826b56ad97c40db

Main takeaway

This is a solid case study on what it actually looks like to train a small multilingual LLM from scratch in 2026: tokenization, storage, Slurm orchestration, distributed training, post-training, evaluation, and model release.

For anyone interested in small language models, multilingual training, edge deployment, or open LLM engineering, the report is worth a read.

submitted by /u/kazzus78
[link] [comments]