AI Navigate

Local Qwen 8B + 4B completes browser automation by replanning one step at a time

Reddit r/LocalLLaMA / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Local Qwen 8B planner + 4B executor achieve browser automation by replanning after each DOM snapshot instead of relying on an upfront, full-task plan, improving reliability on unfamiliar pages.
  • The approach uses a compact, semantic DOM representation (id, role, text, etc.) so the model never sees raw HTML or screenshots, reducing token requirements significantly.
  • In an Ace Hardware demo, the full cart flow was completed using a 4B executor with zero vision, totaling about 15K tokens versus 50-100K+ for vision-based approaches.
  • Modal handling improvements—scanning for and dismissing overlays after each click—greatly reduced failures caused by hidden UI elements.
  • The results suggest stepwise planning could generalize to other unfamiliar sites, with a flow recording attached for an Amazon shopping demo as additional evidence.
Local Qwen 8B + 4B completes browser automation by replanning one step at a time

Small local LLMs got much better at browser automation once I stopped asking them to plan the whole task upfront.

What failed repeatedly was this:

model sees goal → invents full multi-step plan before seeing real page state

That works on familiar sites, but breaks fast on anything unexpected.

What worked better was stepwise planning:

Step 1: see search box → TYPE "grass mower" Step 2: see results → CLICK Add to Cart Step 3: drawer appears → dismiss it Step 4: cart visible → CLICK View Cart Step 5: DONE 

Each step replans from the current DOM snapshot instead of assuming what should exist next.

The other thing that made this work: compact DOM representation. The model never sees raw HTML or screenshots—just a semantic table:

id|role|text|importance|bg|clickable|nearby_text 665|button|Proceed to checkout|675|orange|1| 761|button|Add to cart|720|yellow|1|$299.99 1488|link|ThinkPad E16|478|none|1|Laptop 16" 

So the 4B executor only needs to pick an element ID from a short list. This is what enables small local models—vision approaches burn 2-3K tokens per screenshot, easily 50-100K+ for a full flow. Compact snapshots: ~15K total for the same task.

Tested with Qwen 8B planner + 4B executor on Ace Hardware (site the model had no prior task for):

  • full cart flow completed
  • zero vision model
  • ~15K total tokens (vs 50-100K+ for vision)

One thing that mattered more than expected: modal handling.

After each click, if the DOM suddenly grows, the agent scans for dismiss patterns (close, ×, no thanks, etc.) before planning again.

That alone fixed a lot of failures that looked like "bad reasoning" but were really hidden overlays.

Curious if others are seeing stepwise beat upfront planning once sites get unfamiliar.

The flow recording is attached for the Amazon shopping demo

submitted by /u/Aggressive_Bed7113
[link] [comments]