UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

arXiv cs.CL / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

UI-Zoomer addresses the challenge of GUI grounding in screenshots, especially for small icons and dense layouts, by improving localization accuracy with adaptive zoom-in rather than uniform cropping.
The method reframes whether and how to zoom in as an uncertainty quantification problem, using a confidence-aware gate to trigger zoom-in only when localization is uncertain.
UI-Zoomer’s uncertainty-driven crop sizing estimates a per-instance crop radius by decomposing prediction variance into positional spread across stochastic samples and box extent within a sample (via the law of total variance).
Experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 show consistent improvements over strong baselines across multiple model architectures, with reported gains up to +13.4%, +10.3%, and +4.2%.
The approach is training-free at inference time (no additional training required), making it a practical drop-in enhancement for existing GUI grounding pipelines.

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

Dev.to

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

"The Hidden Costs of AI Agent Deployment: A CFO's Guide to True ROI in Enterpris

Dev.to

"The Real Cost of AI Compute: Why Token Efficiency Separates Viable Agents from

Dev.to

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Key Points

Abstract

Related Articles

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

"The Hidden Costs of AI Agent Deployment: A CFO's Guide to True ROI in Enterpris

"The Real Cost of AI Compute: Why Token Efficiency Separates Viable Agents from

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer