Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The article describes Dante-2B, a 2.1B parameter decoder-only bilingual Italian/English LLM trained from scratch (not adapted from Llama/Mistral) on 2×H200 GPUs, with Phase 1 completed in ~16 days and Phase 2 extending toward longer context.
  • Dante-2B’s main differentiator is a custom 64K BPE tokenizer built specifically for Italian morphology and punctuation (including atomic handling of accented characters and apostrophe contractions), aiming to reduce tokenizer overhead and improve Italian efficiency/quality.
  • The model uses a LLaMA-style architecture with GQA, SwiGLU FFN, RMSNorm, RoPE, and Flash Attention optimization for H200, with no MoE so all 2B parameters are active per token.
  • Phase 1 training used ~90B tokens at seq_len 2048 with DeepSpeed ZeRO-2 and FP8 acceleration (torchao), reporting stable runs (no NaNs/OOM) and consistent MFU, while Phase 2 is adding ~30B tokens to target 4096 context.
  • The author claims that after Phase 1 the model already produces coherent Italian with correct grammar and article usage, positioning it as a practical native Italian-capable small model rather than a large-scale reasoning system.

The problem

If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome.

I decided to fix this from the ground up.

What is Dante-2B

A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs.

Architecture:

  • LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio)
  • SwiGLU FFN, RMSNorm, RoPE
  • d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200)
  • Weight-tied embeddings, no MoE — all 2.1B params active per token
  • Custom 64K BPE tokenizer built specifically for Italian + English + code

Why the tokenizer matters

This is where most multilingual models silently fail. Standard English-centric tokenizers split l'intelligenza into l, ', intelligenza — 3 tokens for what any Italian speaker sees as 1.5 words. Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead.

Dante's tokenizer was trained on a character-balanced mix (~42% Italian, ~36% English, ~22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact. Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units — they're always single tokens, not two bytes glued together by luck.

Small detail, massive impact on efficiency and quality for Italian text.

Training setup

Data: ~300B token corpus. Italian web text (FineWeb-2 IT), English educational content (FineWeb-Edu), Italian public domain literature (171K books), legal/parliamentary texts (Gazzetta Ufficiale, EuroParl), Wikipedia in both languages, and StarCoderData for code. Everything pre-tokenized into uint16 binary with quality tiers.

Phase 1 (just completed): 90B tokens at seq_len 2048. DeepSpeed ZeRO-2, torch.compile with reduce-overhead, FP8 via torchao. Cosine LR schedule 3e-4 → 3e-5 with 2000-step warmup. ~16 days, rock solid — no NaN events, no OOM, consistent 28% MFU.

Phase 2 (in progress): Extending to 4096 context with 30B more tokens at reduced LR. Should take ~4-7 more days.

What it can do right now

After Phase 1 the model already generates coherent Italian text — proper grammar, correct use of articles, reasonable topic continuity. It's a 2B, so don't expect GPT-4 reasoning. But for a model this size, trained natively on Italian, the fluency is already beyond what I've seen from Italian fine-tunes of English models at similar scale.

I'll share samples after Phase 2, when the model has full 4K context.

What's next

  1. Phase 2 completion (est. ~1 week)
  2. HuggingFace release of the base model — weights, tokenizer, config, full model card
  3. SFT phase for instruction following (Phase 3)
  4. Community benchmarks — I want to test against Italian fine-tunes of Llama/Gemma/Qwen at similar sizes

Why I'm posting now

I want to know what you'd actually find useful. A few questions for the community:

  • Anyone working with Italian NLP? I'd love to know what benchmarks or tasks matter most to you.
  • What eval suite would you want to see? I'm planning perplexity on held-out Italian text + standard benchmarks, but if there's a specific Italian eval set I should include, let me know.
  • Interest in the tokenizer alone? The Italian-aware 64K BPE tokenizer might be useful even independently of the model — should I release it separately?

About me

I'm a researcher and entrepreneur based in Rome. PhD in Computer Engineering, I teach AI and emerging tech at university, and I run an innovation company that brings emerging technologies to businesses. Dante-2B started as a research project to prove that you don't need a massive cluster to train a decent model from scratch — you need good data, a clean architecture, and patience.

Everything will be open-sourced. The whole pipeline — from corpus download to tokenizer training to pretraining scripts — will be on GitHub.

Happy to answer any questions. 🇮🇹

submitted by /u/angeletti89
[link] [comments]