The joy and pain of training an LLM from scratch

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

mii-llmは、エッジ展開と多言語（欧州言語中心）を意識した小規模LLM「Zagreus」「Nesso」ファミリーの開発手順をまとめた技術レポートを公開した。
0.4B（約4億）パラメータ級の言語モデルをスクラッチから学習し、英語＋対象言語のバイリンガル事前学習を軸にイタリア語・スペイン語・フランス語・ポルトガル語へ対応している。
公開モデルには、各言語のベースモデルに加えて、会話用途向けのinstructモデルや、構造化／エージェント的タスク向けのagenticモデル、さらにオープンデータとオープンレシピで構築した「Open-Zagreus」も含まれる。
学習構成として64台のNVIDIA A100、約1兆トークン、Hugging Face Nanotron（事前学習）、Axolotl（ポストトレーニング）、Slurmによるマルチノード運用などが記載されている。
1B未満の規模ではMoE（疎行性）よりも安定性やリソース活用を重視した「dense 0.4Bアーキテクチャ」を選んだ理由も説明している。

The joy and pain of training an LLM from scratch

mii-llm just released a detailed technical report on the development of the Zagreus and Nesso model families: a set of 0.4B parameter language models trained from scratch with a focus on edge deployment, multilingual capability, and European languages.

The report documents the full pipeline behind a family of small language models designed for Italian, Spanish, French, and Portuguese, with bilingual pretraining centered on English + target language settings.

Released models

Zagreus-0.4B-ita — English/Italian base model
Zagreus-0.4B-spa — English/Spanish base model
Zagreus-0.4B-fra — English/French base model
Zagreus-0.4B-por — English/Portuguese base model
Nesso-0.4B-instruct — post-trained for conversational use
Nesso-0.4B-agentic — post-trained for structured / agentic tasks
Open-Zagreus-0.4B — fully open variant built with open data and open recipes

Training setup

According to the report, the project used:

64 NVIDIA A100 GPUs
~1 trillion tokens
Datatrove for tokenization
Hugging Face Nanotron for pretraining
Axolotl for post-training
Slurm for multi-node orchestration

The report also explains why a dense 0.4B architecture was selected instead of MoE, arguing that in the sub-1B regime, stability and utilization can matter more than sparse efficiency.

Why this is interesting

A lot of current discussion focuses on frontier-scale models, but this report is a useful example of the opposite direction: small models trained from scratch for practical multilingual edge scenarios.

Some points that stand out:

small multilingual models can still be competitive if the pipeline is well engineered
post-training has a major effect on usability
model behavior differs significantly across Italian and English tasks
open pipelines can still produce meaningful results in this size class
small models still show clear weaknesses in arithmetic, factual recall, repetition, and domain-specific knowledge

Benchmark notes

The report includes comparisons against Qwen3-0.6B and Qwen3.5-0.8B, along with multilingual evaluations and task-by-task analysis.

A few interesting takeaways:

Nesso-0.4B-agentic appears especially strong and consistent on Italian tasks
Qwen3.5-0.8B performs better on several English generative tasks
Qwen3-0.6B stands out on logic / reasoning-style tasks
the fully open variant still achieves competitive results in several settings

Figures

llm-as-judge comparison

https://preview.redd.it/1kw9luyvhpvg1.png?width=1935&format=png&auto=webp&s=f8781a4c64ab51d00853d84120541925d8674c54

https://preview.redd.it/q2hj6vz2ipvg1.png?width=2385&format=png&auto=webp&s=8d4484384743eacbb119896b18f91f894a8eb839

Classical benchmark

https://preview.redd.it/ri1vkdz9gpvg1.png?width=630&format=png&auto=webp&s=f889f5e16366537cc534e50e7921669d8d95fa68

Italian benchmark results

https://preview.redd.it/0ounb0negpvg1.png?width=630&format=png&auto=webp&s=df6fb43e4348795d1a0bd36e98954c6f7afa432e

English benchmark results english-nesso.png

https://preview.redd.it/ttq58dtggpvg1.png?width=630&format=png&auto=webp&s=b2f029b6c6cf310176e11f419826b56ad97c40db

Main takeaway

This is a solid case study on what it actually looks like to train a small multilingual LLM from scratch in 2026: tokenization, storage, Slurm orchestration, distributed training, post-training, evaluation, and model release.

For anyone interested in small language models, multilingual training, edge deployment, or open LLM engineering, the report is worth a read.

submitted by /u/kazzus78
[link] [comments]

FastAPI With LangChain and MongoDB

Dev.to

[Patterns] AI Agent Error Handling That Actually Works

Dev.to

Building ONNX Embedding Workflows in Oracle AI Database with Python

Dev.to

🌱 Green Habit Tracker

Dev.to

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Dev.to

The joy and pain of training an LLM from scratch

Key Points

Released models

Training setup

Why this is interesting

Benchmark notes

Figures

Main takeaway

Related Articles

FastAPI With LangChain and MongoDB

[Patterns] AI Agent Error Handling That Actually Works

Building ONNX Embedding Workflows in Oracle AI Database with Python

🌱 Green Habit Tracker

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer