| mii-llm just released a detailed technical report on the development of the Zagreus and Nesso model families: a set of 0.4B parameter language models trained from scratch with a focus on edge deployment, multilingual capability, and European languages. The report documents the full pipeline behind a family of small language models designed for Italian, Spanish, French, and Portuguese, with bilingual pretraining centered on English + target language settings. Released models
Training setupAccording to the report, the project used:
The report also explains why a dense 0.4B architecture was selected instead of MoE, arguing that in the sub-1B regime, stability and utilization can matter more than sparse efficiency. Why this is interestingA lot of current discussion focuses on frontier-scale models, but this report is a useful example of the opposite direction: small models trained from scratch for practical multilingual edge scenarios. Some points that stand out:
Benchmark notesThe report includes comparisons against Qwen3-0.6B and Qwen3.5-0.8B, along with multilingual evaluations and task-by-task analysis. A few interesting takeaways:
Figuresllm-as-judge comparison Classical benchmark Italian benchmark results English benchmark results english-nesso.png Main takeawayThis is a solid case study on what it actually looks like to train a small multilingual LLM from scratch in 2026: tokenization, storage, Slurm orchestration, distributed training, post-training, evaluation, and model release. For anyone interested in small language models, multilingual training, edge deployment, or open LLM engineering, the report is worth a read. [link] [comments] |
The joy and pain of training an LLM from scratch
Reddit r/LocalLLaMA / 4/17/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- mii-llmは、エッジ展開と多言語(欧州言語中心)を意識した小規模LLM「Zagreus」「Nesso」ファミリーの開発手順をまとめた技術レポートを公開した。
- 0.4B(約4億)パラメータ級の言語モデルをスクラッチから学習し、英語+対象言語のバイリンガル事前学習を軸にイタリア語・スペイン語・フランス語・ポルトガル語へ対応している。
- 公開モデルには、各言語のベースモデルに加えて、会話用途向けのinstructモデルや、構造化/エージェント的タスク向けのagenticモデル、さらにオープンデータとオープンレシピで構築した「Open-Zagreus」も含まれる。
- 学習構成として64台のNVIDIA A100、約1兆トークン、Hugging Face Nanotron(事前学習)、Axolotl(ポストトレーニング)、Slurmによるマルチノード運用などが記載されている。
- 1B未満の規模ではMoE(疎行性)よりも安定性やリソース活用を重視した「dense 0.4Bアーキテクチャ」を選んだ理由も説明している。

![[Patterns] AI Agent Error Handling That Actually Works](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Frn5czaopq2vzo7cglady.png&w=3840&q=75)


![[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Flu4b6ttuhur71z5gemm0.png&w=3840&q=75)