MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

arXiv cs.CL / 4/30/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces Math-PT, a new Portuguese (European and Brazilian) math reasoning benchmark containing 1,729 problems sourced from native Portuguese materials such as olympiads, competitions, and exams.
It argues that existing math-reasoning evaluations are heavily linguistically biased toward English (or English translations), limiting fairness and usefulness across languages.
The authors evaluate current state-of-the-art LLMs on Math-PT and find that frontier reasoning models perform better on multiple-choice questions than open-weight models.
The study also shows a drop in performance for questions that include figures and for open-ended questions, highlighting ongoing weaknesses in multimodal and free-form reasoning.
To support further work, the benchmark dataset and the model outputs are released for public use.

Abstract

The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

How VS Code v1.117.0 Changes Collaboration with GitHub Copilot as Co-Author

Dev.to

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

Key Points

Abstract

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

How VS Code v1.117.0 Changes Collaboration with GitHub Copilot as Co-Author

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer