Using AI to click around on a website burns 45x as many tokens as just using APIs

The Register / 5/7/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The article reports a benchmark finding that AI “vision” agents which navigate by clicking around a website consume about 45x more tokens than agents that use APIs directly.
The headline takeaway is that “seeing” (visual interaction) is significantly more expensive computationally and token-wise than structured API access.
The result implies that, when available, API-based integration is far more cost-efficient for automating web tasks with AI.
The piece frames the tradeoff as a practical design consideration for AI agent builders deciding between visual browsing and API calls.

Using AI to click around on a website burns 45x as many tokens as just using APIs

For AI agents, seeing is expensive

Thomas Claburn Thomas Claburn

Published thu 7 May 2026 // 00:04 UTC

Businesses deploying AI agents to automate computer usage may be spending far more money than necessary if those agents try to emulate human visual interaction.

Reflex, an enterprise application platform, recently set out to compare vision agents with API agents.

A vision agent in this context refers to an AI agent that mimics human interaction by relying on image processing and optical character recognition to operate an application. In this instance, that's Claude Sonnet navigating a web app user interface via browser-use 0.12, a tool for automated web browser operation.

REG AD

MORE CONTEXT

An API agent here refers to Claude Sonnet interacting with a web app via tools and APIs. The agent calls the same handling mechanisms that the UI calls and receives structured data in response, rather than a web page screenshot that must be analyzed.

REG AD

"Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly," explained Palash Awasthi, head of growth at Reflex, in a blog post. "Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable."

The following task was presented to each agent: "A customer named Smith has complained about a recent order. Find the Smith with the most orders, accept all their pending reviews, and mark their most-recent ordered order as delivered."

According to Awasthi, the API agent completed the task in just eight calls. It listed pending customer reviews, accepted them, and marked the order delivered.

The vision agent, however, found only one of four pending reviews because it failed to scroll the page where it would have seen the three other reviews hidden off-screen.

Analyzing and interpreting a web page visually is fundamentally more challenging for an AI model than interacting with API calls and tools.

When the prompt was revised to help the vision model perform better, the vision agent still took ~17 minutes, significantly longer than the API agent at ~20 seconds. The vision agent also consumed a lot more tokens – ~45x more.

The company made its test available as a benchmark for those interested in trying to reproduce the results.

Awasthi said that the cost difference between the two approaches reflects the architecture – vision agents need to see and seeing is costly – each screenshot demands thousands of input tokens to process.

REG AD

Anthropic estimates that processing a 1000×1000-pixel image with Claude Sonnet 4.6 uses about 1,334 tokens.

The vision agent expended around 500,000 input tokens and around 38,000 output tokens to complete its task. The API agent used around 12,150 input tokens and around 934 output tokens.

For Awasthi, the lesson is that while vision agents may be necessary for interacting with apps you don't control, inwardly focused agents should target APIs. ®

claude sonnet ai and ml computer vision news anthropic artificial intelligence application programming interfaces

Black Hat USA

AI Business

The 55.6% problem: why frontier LLMs fail at embedded code

Dev.to

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Dev.to

Stop Burning Cash: How to Compress LLM Prompts by 60% in Real-Time | 0507-0255

Dev.to

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

Reddit r/artificial

Using AI to click around on a website burns 45x as many tokens as just using APIs

Key Points

Using AI to click around on a website burns 45x as many tokens as just using APIs

MORE CONTEXT

Claude hitches ride on SpaceX's datacenter capacity

Musk has never built a wafer fab, but he wants to burn $119B on one anyway

DRAM drought to dog AMD's chips this year

AWS lets agents drive its virtual cloudy desktops – which could cost 500,000 tokens per click

Related Articles

Black Hat USA

The 55.6% problem: why frontier LLMs fail at embedded code

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Stop Burning Cash: How to Compress LLM Prompts by 60% in Real-Time | 0507-0255

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer