AI
Using AI to click around on a website burns 45x as many tokens as just using APIs
For AI agents, seeing is expensive
Businesses deploying AI agents to automate computer usage may be spending far more money than necessary if those agents try to emulate human visual interaction.
Reflex, an enterprise application platform, recently set out to compare vision agents with API agents.
A vision agent in this context refers to an AI agent that mimics human interaction by relying on image processing and optical character recognition to operate an application. In this instance, that's Claude Sonnet navigating a web app user interface via browser-use 0.12, a tool for automated web browser operation.
An API agent here refers to Claude Sonnet interacting with a web app via tools and APIs. The agent calls the same handling mechanisms that the UI calls and receives structured data in response, rather than a web page screenshot that must be analyzed.
"Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly," explained Palash Awasthi, head of growth at Reflex, in a blog post. "Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable."
The following task was presented to each agent: "A customer named Smith has complained about a recent order. Find the Smith with the most orders, accept all their pending reviews, and mark their most-recent ordered order as delivered."
According to Awasthi, the API agent completed the task in just eight calls. It listed pending customer reviews, accepted them, and marked the order delivered.
The vision agent, however, found only one of four pending reviews because it failed to scroll the page where it would have seen the three other reviews hidden off-screen.
Analyzing and interpreting a web page visually is fundamentally more challenging for an AI model than interacting with API calls and tools.
When the prompt was revised to help the vision model perform better, the vision agent still took ~17 minutes, significantly longer than the API agent at ~20 seconds. The vision agent also consumed a lot more tokens – ~45x more.
The company made its test available as a benchmark for those interested in trying to reproduce the results.
Awasthi said that the cost difference between the two approaches reflects the architecture – vision agents need to see and seeing is costly – each screenshot demands thousands of input tokens to process.
Anthropic estimates that processing a 1000×1000-pixel image with Claude Sonnet 4.6 uses about 1,334 tokens.
The vision agent expended around 500,000 input tokens and around 38,000 output tokens to complete its task. The API agent used around 12,150 input tokens and around 934 output tokens.
For Awasthi, the lesson is that while vision agents may be necessary for interacting with apps you don't control, inwardly focused agents should target APIs. ®



