PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

arXiv cs.AI / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PSPA-Bench, a new benchmark designed to evaluate how well smartphone GUI agents personalize their assistance to individual user workflows and preferences rather than providing generic solutions.
  • PSPA-Bench includes 12,855+ personalized instructions covering 10 daily-use scenarios and 22 mobile apps, and it uses a structure-aware process evaluation method for fine-grained measurement.
  • Experiments benchmark 11 state-of-the-art GUI agents and find that existing approaches perform poorly in personalized settings, with even the best agent showing limited success.
  • The analysis suggests three improvement directions: reasoning-focused models tend to outperform general LLMs, perception is a critical (though still relatively simple) capability, and reflection plus long-term memory can enhance adaptation.

Abstract

Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this personalization dimension due to sparse user-specific data and the lack of fine-grained evaluation metrics. To address this gap, we present PSPA-Bench, the benchmark dedicated to evaluating personalization in smartphone GUI agents. PSPA-Bench comprises over 12,855 personalized instructions aligned with real-world user behaviors across 10 representative daily-use scenarios and 22 mobile apps, and introduces a structure-aware process evaluation method that measures agents' personalized capabilities at a fine-grained level. Through PSPA-Bench, we benchmark 11 state-of-the-art GUI agents. Results reveal that current methods perform poorly under personalized settings, with even the strongest agent achieving limited success. Our analysis further highlights three directions for advancing personalized GUI agents: (1) reasoning-oriented models consistently outperform general LLMs, (2) perception remains a simple yet critical capability, and (3) reflection and long-term memory mechanisms are key to improving adaptation. Together, these findings establish PSPA-Bench as a foundation for systematic study and future progress in personalized GUI agents.