NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus
arXiv cs.AI / 5/4/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper introduces NorBERTo, a ModernBERT-based encoder model for Portuguese with long-context support and efficient attention mechanisms.
- NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus containing 331B GPT-2 tokens sourced from diverse web data and existing multilingual datasets.
- Benchmarking on standardized Portuguese NLP tasks (including semantic similarity, textual entailment, and classification) shows NorBERTo-large achieving top encoder-model results on PLUE, such as 0.9191 F1 on MRPC and 0.7689 accuracy on RTE.
- On ASSIN 2, NorBERTo-large achieves the highest entailment F1 (~0.904) among encoders considered, while some earlier models (e.g., Albertina-900M and BERTimbau-large) still outperform it in parts of the evaluation.
- Aurora-PT is claimed to be the largest openly available monolingual Portuguese corpus to date, and NorBERTo is positioned as practical mid-sized infrastructure for fine-tuning and deployment, including as a backbone for retrieval-augmented generation.
Related Articles

Black Hat USA
AI Business
A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"
Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds
Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️
Dev.to
Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?
Dev.to