InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?
arXiv cs.AI / 5/1/2026
📰 NewsModels & Research
Key Points
- The paper introduces InteractWeb-Bench, a multimodal interactive benchmark for website generation that specifically tests agents under non-expert, low-code user conditions rather than idealized inputs.
- It identifies a real-world failure mode called “blind execution,” where semantic mismatch between ambiguous, low-quality user instructions and model understanding causes agents to proceed incorrectly.
- InteractWeb-Bench uses four types of user agents and persona-driven instruction perturbations (including ambiguity, redundancy, and contradiction) based on requirement-engineering defect taxonomies.
- An interactive execution environment is built with a unified action space (Clarify, Implement, Verify, Submit) to support iterative intent refinement, code synthesis, and visual feedback validation.
- Experiments show that frontier multimodal LLM-based agents still struggle with blind execution, indicating limitations in intent recognition and adaptive interaction.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’
The Register
Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats
Reddit r/LocalLLaMA
![Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2Fvutakjb0vgyg1.png%3Fwidth%3D140%26height%3D59%26auto%3Dwebp%26s%3D08ecb95fd65ade25c924988f1992e9abe3d79f62&w=3840&q=75)
Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]
Reddit r/MachineLearning