Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
arXiv cs.CL / 5/5/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The study finds that holistic LLM-as-a-judge rankings are highly sensitive to the bias of the selected judge model, which can distort comparisons.
- Prosa introduces a new real-user, multi-turn Brazilian Portuguese benchmark built from 1,000 WildChat conversations and evaluated with three judges from different model families across 16 candidate models.
- Using a binary rubric scoring approach with multi-judge filtering significantly reduces judge-model sensitivity: the three judges agree on all 16 ranks under filtered rubric scoring versus only 7 under holistic scoring.
- The rubric filtering pipeline improves the benchmark’s discriminative power by widening the average score gap between neighboring models by 47% and makes evaluation more repeatable, with an estimated cost of about $2.1 per new model using Gemini 3 Flash as the judge.
- The authors release the benchmark and filtering code, enabling future model evaluations under consistent conditions and making the rubric-based scoring method reusable for other open-ended settings.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

The 55.6% problem: why frontier LLMs fail at embedded code
Dev.to

Four CVEs in a week, all the same shape: when agents execute LLM-generated code
Dev.to

Stop Burning Cash: How to Compress LLM Prompts by 60% in Real-Time | 0507-0255
Dev.to
Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold
Reddit r/artificial

Samsung halts all home appliance sales in China as pivot to AI accelerates
SCMP Tech