Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

arXiv cs.LG / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces Odysseys, a new benchmark with 200 realistic, long-horizon, multi-site web tasks drawn from real browsing sessions and evaluated on the live Internet.
It argues existing web-agent benchmarks overemphasize short, single-site tasks and that binary pass/fail scoring is insufficient for long-horizon evaluation.
Odysseys uses rubric-based scoring (averaging 6.1 graded rubrics per task) to better match human judgments and provide a more fine-grained alternative to trajectory-level LLM-as-a-judge metrics.
Experiments with leading frontier models show a highest success rate of 44.5%, indicating significant room for improvement, and it also evaluates efficiency using a Trajectory Efficiency metric.
Even the strongest agents achieve only 1.15% efficiency (rubric score per step), highlighting that long-horizon agents must succeed efficiently rather than merely eventually.

Abstract

Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

LLMs will be a commodity

Reddit r/artificial

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Key Points

Abstract

💡 Insights using this article

Related Articles

LLMs will be a commodity

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dex lands $5.3M to grow its AI-driven talent matching platform

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer