GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
arXiv cs.LG / 4/17/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- GUI grounding models achieve high benchmark accuracy, but their performance drops sharply (27–56 points) when tasks require spatial reasoning beyond direct element naming.
- The article argues that existing benchmarks overestimate robustness because they test each screenshot with a single fixed instruction, masking failure modes.
- It introduces GUI-Perturbed, a framework that independently varies visual scenes and instructions to measure how robust grounding models are along separate capability axes.
- Experiments on three 7B models show systematic collapses with relational instructions, significant degradation under ~70% browser zoom, and that rank-8 LoRA fine-tuning with augmented data worsens performance.
- The authors release the dataset, augmentation pipeline, and a fine-tuned model to enable more diagnostic evaluation beyond aggregate benchmarks.



