WebHarbor - We "dock" the real websites into local for web agents! [R]

Reddit r/MachineLearning / 5/14/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • WebHarbor packs 15 popular websites into self-contained Flask + SQLite apps inside a single Docker image, enabling reproducible “local mirrors” for evolving web GUI agent environments.
  • It includes a control plane that can reset each mirrored site back to a byte-identical state in under one second, addressing issues like content drift and network flakiness in live-web benchmarking.
  • The project uses human-in-the-loop coding agents (e.g., Claude Code/CodeX) to automate packaging and site mirroring, with human verification before merging via open PRs.
  • WebHarbor claims out-of-the-box support for all 643 WebVoyager tasks and is seeking contributions to expand from 15 to 100+ sites, including covering Online-Mind2Web’s 147 sites.
  • The release also includes skills/tools for contributors to create new mirrors quickly (typically within a day), along with links to a Hugging Face dataset and a GitHub repository.

Hello! Excited to share our latest community-driven research project: WebHarbor: Docking Real Websites for Evolving GUI Agent Environments!

TL;DR: 15 popular websites (Amazon, GitHub, BBC News, arXiv, Booking, Hugging Face, etc.) packaged as self-contained Flask + SQLite apps in a single Docker image, with a control plane that resets each site to byte-identical state in <1 second, all by human-in-the-loop coding agent (e.g., Claude Code or CodeX). We support all 643 WebVoyager tasks out of the box.

Call for contribution: Our Next goal is 100+ popular websites — covering all of Online-Mind2Web (147 sites) and beyond. Two tracks:

  • Contribute a new mirror site (use the coding-agent pipeline → human verify → open PR) → co-author on the final paper
  • Review submitted PRs (5 reviews → co-author)

We also released useful skills for you(your coding agent) to work on it! Typically you can create a new mirron within 1 day! See more contribution details at Contribute Guide.

Why WebHarbor: running web agent benchmarks on the live web is a nightmare — reCAPTCHA, geo-blocks, content drift, network flakiness, and tasks that go stale within months. Plus you can't reset the live web, which rules out heavy RL training. You will need a lightweight, easy-to-reset, task-driven evolving environments for web agent, both evaluation and training!

Related Resources:

Name Link
🏠 WebHarbor Project Page WebHarbor
🤗 HuggingFace Dataset ChilleD/WebHarbor
💻 WebHarbor GitHub Code Repo
📊 Contribution Guide Guide Details
📝 Contribution Request Form Google Form

Welcome suggestions and discussions!

submitted by /u/ArtichokeHelpful7462
[link] [comments]