CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

arXiv cs.CL / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

CAPITU is introduced as a Brazilian Portuguese benchmark to evaluate LLM instruction-following using prompts grounded in eight canonical works of Brazilian literature.
The benchmark covers 59 instruction types across seven categories, including Portuguese-specific linguistic and structural constraints that are designed to be automatically verifiable without human/LLM judging.
Experiments on 18 state-of-the-art models show very high strict accuracy for frontier reasoning models (e.g., GPT-5.2 with reasoning at 98.5%) and better cost-efficiency for Portuguese-specialized models (e.g., Sabiazinho-4 at 87.0% for $0.13 vs Claude-Haiku-4.5 at 73.5% for $1.12).
In multi-turn settings, performance varies widely by model (about 60% to 96% conversation-level accuracy), revealing challenges such as morphological constraint handling, exact counting, and degradation of constraint persistence over turns.
The authors release the full benchmark, evaluation code, and baseline results to support further research on instruction-following in Portuguese.

Abstract

We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \$0.13 vs Claude-Haiku-4.5: 73.5% at \$1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

Dev.to

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Dev.to

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Dev.to

Neural Networks in Mobile Robot Motion

Dev.to

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Key Points

Abstract

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Neural Networks in Mobile Robot Motion

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer