Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The study finds that holistic LLM-as-a-judge rankings are highly sensitive to the bias of the selected judge model, which can distort comparisons.
Prosa introduces a new real-user, multi-turn Brazilian Portuguese benchmark built from 1,000 WildChat conversations and evaluated with three judges from different model families across 16 candidate models.
Using a binary rubric scoring approach with multi-judge filtering significantly reduces judge-model sensitivity: the three judges agree on all 16 ranks under filtered rubric scoring versus only 7 under holistic scoring.
The rubric filtering pipeline improves the benchmark’s discriminative power by widening the average score gap between neighboring models by 47% and makes evaluation more repeatable, with an estimated cost of about $2.1 per new model using Gemini 3 Flash as the judge.
The authors release the benchmark and filtering code, enabling future model evaluations under consistent conditions and making the rubric-based scoring method reusable for other open-ended settings.

Abstract

Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/5DailyView insight →

The 55.6% problem: why frontier LLMs fail at embedded code

Dev.to

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Dev.to

Stop Burning Cash: How to Compress LLM Prompts by 60% in Real-Time | 0507-0255

Dev.to

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

Reddit r/artificial

Samsung halts all home appliance sales in China as pivot to AI accelerates

SCMP Tech

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Key Points

Abstract

💡 Insights using this article

Related Articles

The 55.6% problem: why frontier LLMs fail at embedded code

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Stop Burning Cash: How to Compress LLM Prompts by 60% in Real-Time | 0507-0255

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

Samsung halts all home appliance sales in China as pivot to AI accelerates

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer