Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

arXiv cs.CL / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing mobile-use agent learning from demonstrations captures only explicit user step sequences, but misses implicit intentions such as personal preferences needed for true personalization.
It introduces MobileIAR, a new dataset with human-intent-aligned actions and ground-truth actions, to more comprehensively evaluate intention alignment between agents and humans.
It proposes IFRAgent, which separates explicit intention flow recognition (to build a SOP library) from implicit intention flow recognition (to build a user-level habit repository) using human demonstrations.
IFRAgent uses a SOP extractor plus retrieval-augmented generation and a query rewriter to transform ambiguous user queries into personalized query/SOP pairs for better intent matching.
Experiments show IFRAgent improves human intention alignment by an average of 6.79% (32.06% relative) and increases step completion rates by an average of 5.30% (26.34% relative), and the authors release code publicly.

Abstract

As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents' understanding of human intent. Then we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention \textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79\% (32.06\% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30\% (26.34\% relative improvement). The codes are available at https://github.com/MadeAgents/Quick-on-the-Uptake.