The Amazing Agent Race: Strong Tool Users, Weak Navigators

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

既存のLLMエージェント向けツール使用ベンチマークは線形（2〜5ステップの連鎖）が中心であるため、実際の弱点（ナビゲーション）を見落としやすいことが分析で示されました。
The Amazing Agent Race（AAR）はDAG構造の「legs」（フォーク・マージ型のツールチェーン）を持つベンチマークで、Wikipedia移動、複数ステップのツール実行、検証可能な回答の集約を要求します。
1400件の手続き生成インスタンス（順次版800、合成版600）を4段階の難易度で提供し、ライブAPI検証と3種の診断指標（finish-line精度、pit-stop訪問率、roadblock達成率）で「ナビ」「ツール」「算術」の失敗要因を切り分けます。
3つのエージェント・フレームワークを評価すると最高でも37.2%精度に留まり、失敗の主因は27〜52%のナビゲーションエラーで、ツール使用エラーは17%未満でした。
AARの合成構造により、エージェントがツール呼び出しでなく「正しいページへ辿り着く」能力で失敗する盲点が浮き彫りになり、ベンチマークの設計が結果解釈に大きく影響することが示唆されました。

Abstract

Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Key Points

Abstract

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer