| Small local LLMs got much better at browser automation once I stopped asking them to plan the whole task upfront. What failed repeatedly was this: model sees goal → invents full multi-step plan before seeing real page state That works on familiar sites, but breaks fast on anything unexpected. What worked better was stepwise planning: Each step replans from the current DOM snapshot instead of assuming what should exist next. The other thing that made this work: compact DOM representation. The model never sees raw HTML or screenshots—just a semantic table: So the 4B executor only needs to pick an element ID from a short list. This is what enables small local models—vision approaches burn 2-3K tokens per screenshot, easily 50-100K+ for a full flow. Compact snapshots: ~15K total for the same task. Tested with Qwen 8B planner + 4B executor on Ace Hardware (site the model had no prior task for):
One thing that mattered more than expected: modal handling. After each click, if the DOM suddenly grows, the agent scans for dismiss patterns ( That alone fixed a lot of failures that looked like "bad reasoning" but were really hidden overlays. Curious if others are seeing stepwise beat upfront planning once sites get unfamiliar. The flow recording is attached for the Amazon shopping demo [link] [comments] |
Local Qwen 8B + 4B completes browser automation by replanning one step at a time
Reddit r/LocalLLaMA / 3/17/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- Local Qwen 8B planner + 4B executor achieve browser automation by replanning after each DOM snapshot instead of relying on an upfront, full-task plan, improving reliability on unfamiliar pages.
- The approach uses a compact, semantic DOM representation (id, role, text, etc.) so the model never sees raw HTML or screenshots, reducing token requirements significantly.
- In an Ace Hardware demo, the full cart flow was completed using a 4B executor with zero vision, totaling about 15K tokens versus 50-100K+ for vision-based approaches.
- Modal handling improvements—scanning for and dismissing overlays after each click—greatly reduced failures caused by hidden UI elements.
- The results suggest stepwise planning could generalize to other unfamiliar sites, with a flow recording attached for an Amazon shopping demo as additional evidence.