GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

arXiv cs.CL / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that multilingual agent benchmarks built mainly via machine translation and light post-editing can become invalid due to query–answer misalignment and culturally irrelevant context.
It proposes a refined adaptation workflow that explicitly aligns functions, cultural context, and difficulty calibration, validated through automated checks plus human review.
Using this workflow, the authors introduce GAIA-v2-LILT, a re-audited multilingual extension of the GAIA agent benchmark spanning five non-English languages.
Experiments show that the workflow boosts agent success rates by up to 32.7% versus minimally translated baselines and narrows performance to within 3.1% of English in the closest audited setting.
The work suggests that much of the multilingual performance gap is caused by benchmark measurement error, and it provides both the dataset (via MAPS on Hugging Face) and the experimental code on GitHub.

Abstract

Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package at https://huggingface.co/datasets/Fujitsu-FRE/MAPS/viewer/GAIA-v2-LILT. We also release the code used in our experiments at https://github.com/lilt/gaia-v2-lilt.

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

An API testing tool built specifically for AI agent loops

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Automatic Error Recovery in AI Agent Networks

Dev.to

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

Key Points

Abstract

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

An API testing tool built specifically for AI agent loops

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Automatic Error Recovery in AI Agent Networks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer