WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
arXiv cs.CL / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces WebMall, an offline benchmark designed to evaluate LLM-based web agents on complex comparison-shopping tasks across multiple simulated e-shops.
- Unlike earlier benchmarks that either use live web (hard to reproduce) or focus on a single shop with relatively simple e-commerce data, WebMall models four shops with heterogeneous product information.
- WebMall tasks cover a spectrum from targeted product search and price comparison to finding complementary/substitute items and completing checkout flows.
- The benchmark is validated with eight different agents that vary in observation space, short-term memory support, and underlying LLM, revealing that the hardest categories have task completion rates under 65% even for the best agents.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER