MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MM-WebAgent, a hierarchical multimodal web agent designed to generate webpages by coordinating AIGC tools rather than generating elements independently.
  • It uses hierarchical planning and iterative self-reflection to jointly optimize global page layout, local multimodal content, and how those parts integrate for style consistency and overall coherence.
  • The authors provide a benchmark specifically for multimodal webpage generation along with a multi-level evaluation protocol to assess results more systematically.
  • Experiments report that MM-WebAgent outperforms both code-generation and agent-based baselines, particularly for multimodal element generation and their integration into the page.

Abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.