Adaptive Vision-Language Model Routing for Computer Use Agents
arXiv cs.CL / 3/16/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- AVR introduces a lightweight semantic routing layer between the CUA orchestrator and a pool of vision-language models (VLMs) to route each tool call to the most cost-effective model based on estimated action difficulty and a quick confidence probe.
- The approach formalizes a cost–accuracy trade-off, derives a threshold-based policy for model selection, and benefits from memory-backed context to narrow gaps between small and large models.
- Evaluations on ScreenSpot-Pro grounding data and the OpenClaw benchmark show up to 78% inference cost reductions while remaining within 2 percentage points of an all-large-model baseline, and a Visual Confused Deputy guardrail escalates high-risk actions to the strongest model for safety.
- The authors provide code, data, and benchmarks (GitHub link) to enable replication, presenting a unified framework for efficiency and safety in VLM-based computer-use agents.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




