NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

arXiv cs.AI / 5/4/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces NorBERTo, a ModernBERT-based encoder model for Portuguese with long-context support and efficient attention mechanisms.
NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus containing 331B GPT-2 tokens sourced from diverse web data and existing multilingual datasets.
Benchmarking on standardized Portuguese NLP tasks (including semantic similarity, textual entailment, and classification) shows NorBERTo-large achieving top encoder-model results on PLUE, such as 0.9191 F1 on MRPC and 0.7689 accuracy on RTE.
On ASSIN 2, NorBERTo-large achieves the highest entailment F1 (~0.904) among encoders considered, while some earlier models (e.g., Albertina-900M and BERTimbau-large) still outperform it in parts of the evaluation.
Aurora-PT is claimed to be the largest openly available monolingual Portuguese corpus to date, and NorBERTo is positioned as practical mid-sized infrastructure for fine-tuning and deployment, including as a backbone for retrieval-augmented generation.

Abstract

High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms. NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against Strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE. On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest entailment F1 (~0.904) among all encoders considered, although Albertina-900M and BERTimbau-large still hold an advantage. To the best of our knowledge, Aurora-PT is currently the largest openly available monolingual Portuguese corpus, surpassing previous resources. NorBERTo provides a modern, mid-sized encoder designed for realistic deployment scenarios: it is straight-forward to fine-tune, efficient to serve, and well suited as a backbone for retrieval-augmented generation and other downstream Portuguese NLP systems.

Black Hat USA

AI Business

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

Key Points

Abstract

Related Articles

Black Hat USA

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer