UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

arXiv cs.CL / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • UI-Zoomer addresses the challenge of GUI grounding in screenshots, especially for small icons and dense layouts, by improving localization accuracy with adaptive zoom-in rather than uniform cropping.
  • The method reframes whether and how to zoom in as an uncertainty quantification problem, using a confidence-aware gate to trigger zoom-in only when localization is uncertain.
  • UI-Zoomer’s uncertainty-driven crop sizing estimates a per-instance crop radius by decomposing prediction variance into positional spread across stochastic samples and box extent within a sample (via the law of total variance).
  • Experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 show consistent improvements over strong baselines across multiple model architectures, with reported gains up to +13.4%, +10.3%, and +4.2%.
  • The approach is training-free at inference time (no additional training required), making it a practical drop-in enhancement for existing GUI grounding pipelines.

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.