WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

arXiv cs.CL / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces WebMall, an offline benchmark designed to evaluate LLM-based web agents on complex comparison-shopping tasks across multiple simulated e-shops.
  • Unlike earlier benchmarks that either use live web (hard to reproduce) or focus on a single shop with relatively simple e-commerce data, WebMall models four shops with heterogeneous product information.
  • WebMall tasks cover a spectrum from targeted product search and price comparison to finding complementary/substitute items and completing checkout flows.
  • The benchmark is validated with eight different agents that vary in observation space, short-term memory support, and underlying LLM, revealing that the hardest categories have task completion rates under 65% even for the best agents.

Abstract

LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, the latter allowing for the exact reproduction of the experimental setup. While DeepShop and ShoppingComp provide online benchmarks that require agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, and Mind2Web cover only comparatively simple e-commerce tasks performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex retrieval tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced searches for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with the best-performing agents achieving task completion rates below 65% in the task categories cheapest product search and vague product search.