Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

arXiv cs.AI / 3/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors propose a hierarchical planning framework for LLM-based web agents that separates analysis into high-level planning, low-level execution, and replanning to diagnose failures.
They show that using structured Planning Domain Definition Language (PDDL) plans yields more concise and goal-directed strategies than natural-language plans.
The study finds that low-level execution is the dominant bottleneck, highlighting the need to improve perceptual grounding and adaptive control in addition to high-level reasoning.
The framework provides a principled foundation for diagnosing and advancing LLM web agents, guiding future research on where to focus improvements.

Abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

The massive shift toward edge computing and local processing

Dev.to

Self-Refining Agents in Spec-Driven Development

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Key Points

Abstract

Related Articles

The massive shift toward edge computing and local processing

Self-Refining Agents in Spec-Driven Development

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer