CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context
arXiv cs.CL / 3/25/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- CAPITU is introduced as a Brazilian Portuguese benchmark to evaluate LLM instruction-following using prompts grounded in eight canonical works of Brazilian literature.
- The benchmark covers 59 instruction types across seven categories, including Portuguese-specific linguistic and structural constraints that are designed to be automatically verifiable without human/LLM judging.
- Experiments on 18 state-of-the-art models show very high strict accuracy for frontier reasoning models (e.g., GPT-5.2 with reasoning at 98.5%) and better cost-efficiency for Portuguese-specialized models (e.g., Sabiazinho-4 at 87.0% for $0.13 vs Claude-Haiku-4.5 at 73.5% for $1.12).
- In multi-turn settings, performance varies widely by model (about 60% to 96% conversation-level accuracy), revealing challenges such as morphological constraint handling, exact counting, and degradation of constraint persistence over turns.
- The authors release the full benchmark, evaluation code, and baseline results to support further research on instruction-following in Portuguese.
Related Articles
I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial
Dev.to
The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage
Dev.to
AI 自主演化的時代來臨:從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage
Dev.to
Most Dev.to Accounts Are Run by Humans. This One Isn't.
Dev.to
Neural Networks in Mobile Robot Motion
Dev.to