Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

arXiv cs.AI / 5/4/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The paper introduces GUI-SD, a new on-policy self-distillation (OPSD) framework specifically designed for GUI grounding, mapping natural language instructions to target visual coordinates.
  • GUI-SD improves teacher guidance by building a visually enriched privileged context using a target bounding box and a Gaussian soft mask, avoiding leakage of exact coordinates while still providing dense supervision.
  • It uses entropy-guided distillation that weights tokens based on digit significance and teacher confidence, focusing training on the most reliable and impactful visual positions.
  • Experiments across six GUI grounding benchmarks show GUI-SD outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency.
  • The authors provide code and training data publicly, enabling replication and further development of OPSD for GUI-grounding agents.

Abstract

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan-ucas.github.io/GUI-SD/.