Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
arXiv cs.LG / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper introduces Odysseys, a new benchmark with 200 realistic, long-horizon, multi-site web tasks drawn from real browsing sessions and evaluated on the live Internet.
- It argues existing web-agent benchmarks overemphasize short, single-site tasks and that binary pass/fail scoring is insufficient for long-horizon evaluation.
- Odysseys uses rubric-based scoring (averaging 6.1 graded rubrics per task) to better match human judgments and provide a more fine-grained alternative to trajectory-level LLM-as-a-judge metrics.
- Experiments with leading frontier models show a highest success rate of 44.5%, indicating significant room for improvement, and it also evaluates efficiency using a Trajectory Efficiency metric.
- Even the strongest agents achieve only 1.15% efficiency (rubric score per step), highlighting that long-horizon agents must succeed efficiently rather than merely eventually.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
LLMs will be a commodity
Reddit r/artificial

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant
Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu