Adaptive Vision-Language Model Routing for Computer Use Agents

arXiv cs.CL / 3/16/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

AVR introduces a lightweight semantic routing layer between the CUA orchestrator and a pool of vision-language models (VLMs) to route each tool call to the most cost-effective model based on estimated action difficulty and a quick confidence probe.
The approach formalizes a cost–accuracy trade-off, derives a threshold-based policy for model selection, and benefits from memory-backed context to narrow gaps between small and large models.
Evaluations on ScreenSpot-Pro grounding data and the OpenClaw benchmark show up to 78% inference cost reductions while remaining within 2 percentage points of an all-large-model baseline, and a Visual Confused Deputy guardrail escalates high-risk actions to the strongest model for safety.
The authors provide code, data, and benchmarks (GitHub link) to enable replication, presenting a unified framework for efficiency and safety in VLM-based computer-use agents.

Abstract

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/16DailyView insight →

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Adaptive Vision-Language Model Routing for Computer Use Agents

Key Points

Abstract

💡 Insights using this article

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer