MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

arXiv cs.CL / 4/30/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces Math-PT, a new Portuguese (European and Brazilian) math reasoning benchmark containing 1,729 problems sourced from native Portuguese materials such as olympiads, competitions, and exams.
  • It argues that existing math-reasoning evaluations are heavily linguistically biased toward English (or English translations), limiting fairness and usefulness across languages.
  • The authors evaluate current state-of-the-art LLMs on Math-PT and find that frontier reasoning models perform better on multiple-choice questions than open-weight models.
  • The study also shows a drop in performance for questions that include figures and for open-ended questions, highlighting ongoing weaknesses in multimodal and free-form reasoning.
  • To support further work, the benchmark dataset and the model outputs are released for public use.

Abstract

The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.