GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
arXiv cs.CL / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper argues that multilingual agent benchmarks built mainly via machine translation and light post-editing can become invalid due to query–answer misalignment and culturally irrelevant context.
- It proposes a refined adaptation workflow that explicitly aligns functions, cultural context, and difficulty calibration, validated through automated checks plus human review.
- Using this workflow, the authors introduce GAIA-v2-LILT, a re-audited multilingual extension of the GAIA agent benchmark spanning five non-English languages.
- Experiments show that the workflow boosts agent success rates by up to 32.7% versus minimally translated baselines and narrows performance to within 3.1% of English in the closest audited setting.
- The work suggests that much of the multilingual performance gap is caused by benchmark measurement error, and it provides both the dataset (via MAPS on Hugging Face) and the experimental code on GitHub.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to

An API testing tool built specifically for AI agent loops
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Automatic Error Recovery in AI Agent Networks
Dev.to