MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing mobile-agent benchmarks (e.g., AndroidWorld) rely on emulator/system-level signals that don’t reflect real-world cases where many third-party apps don’t expose success metrics.
  • It introduces MobiFlow, a mobile agent evaluation framework that builds tasks from arbitrary third-party applications to better match real usage conditions.
  • MobiFlow uses an efficient graph-construction method based on multi-trajectory fusion to compress the state space and support dynamic interaction during evaluation.
  • The framework includes 20 widely used third-party apps and 240 real-world tasks, along with enriched evaluation metrics.
  • Compared with AndroidWorld, MobiFlow reports evaluation outcomes that align more closely with human judgments and can inform training of future GUI-based models under real workloads.

Abstract

Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow's evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.