Using AI to click around on a website burns 45x as many tokens as just using APIs

The Register / 5/7/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article reports a benchmark finding that AI “vision” agents which navigate by clicking around a website consume about 45x more tokens than agents that use APIs directly.
  • The headline takeaway is that “seeing” (visual interaction) is significantly more expensive computationally and token-wise than structured API access.
  • The result implies that, when available, API-based integration is far more cost-efficient for automating web tasks with AI.
  • The piece frames the tradeoff as a practical design consideration for AI agent builders deciding between visual browsing and API calls.

AI

Using AI to click around on a website burns 45x as many tokens as just using APIs

For AI agents, seeing is expensive

Thomas Claburn Thomas Claburn
Published

Businesses deploying AI agents to automate computer usage may be spending far more money than necessary if those agents try to emulate human visual interaction.

Reflex, an enterprise application platform, recently set out to compare vision agents with API agents.

A vision agent in this context refers to an AI agent that mimics human interaction by relying on image processing and optical character recognition to operate an application. In this instance, that's Claude Sonnet navigating a web app user interface via browser-use 0.12, a tool for automated web browser operation.

REG AD

An API agent here refers to Claude Sonnet interacting with a web app via tools and APIs. The agent calls the same handling mechanisms that the UI calls and receives structured data in response, rather than a web page screenshot that must be analyzed.

REG AD

"Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly," explained Palash Awasthi, head of growth at Reflex, in a blog post. "Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable."

The following task was presented to each agent: "A customer named Smith has complained about a recent order. Find the Smith with the most orders, accept all their pending reviews, and mark their most-recent ordered order as delivered."

According to Awasthi, the API agent completed the task in just eight calls. It listed pending customer reviews, accepted them, and marked the order delivered. 

The vision agent, however, found only one of four pending reviews because it failed to scroll the page where it would have seen the three other reviews hidden off-screen.

Analyzing and interpreting a web page visually is fundamentally more challenging for an AI model than interacting with API calls and tools.

When the prompt was revised to help the vision model perform better, the vision agent still took ~17 minutes, significantly longer than the API agent at ~20 seconds. The vision agent also consumed a lot more tokens – ~45x more.

The company made its test available as a benchmark for those interested in trying to reproduce the results.

Awasthi said that the cost difference between the two approaches reflects the architecture – vision agents need to see and seeing is costly – each screenshot demands thousands of input tokens to process.

REG AD

Anthropic estimates that processing a 1000×1000-pixel image with Claude Sonnet 4.6 uses about 1,334 tokens

The vision agent expended around 500,000 input tokens and around 38,000 output tokens to complete its task. The API agent used around 12,150 input tokens and around 934 output tokens.

For Awasthi, the lesson is that while vision agents may be necessary for interacting with apps you don't control, inwardly focused agents should target APIs. ®