InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

arXiv cs.AI / 5/1/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces InteractWeb-Bench, a multimodal interactive benchmark for website generation that specifically tests agents under non-expert, low-code user conditions rather than idealized inputs.
It identifies a real-world failure mode called “blind execution,” where semantic mismatch between ambiguous, low-quality user instructions and model understanding causes agents to proceed incorrectly.
InteractWeb-Bench uses four types of user agents and persona-driven instruction perturbations (including ambiguity, redundancy, and contradiction) based on requirement-engineering defect taxonomies.
An interactive execution environment is built with a unified action space (Clarify, Implement, Verify, Submit) to support iterative intent refinement, code synthesis, and visual feedback validation.
Experiments show that frontier multimodal LLM-based agents still struggle with blind execution, indicating limitations in intent recognition and adaptive interaction.

Abstract

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

The Register

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Reddit r/LocalLLaMA

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Reddit r/MachineLearning

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer