VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

arXiv cs.AI / 4/10/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

本稿は、既存のモバイルGUIエージェント向けオンラインベンチマークがアプリ中心・タスクが均質で、現実のモバイル利用の多様性や不安定さを反映できていないという課題を指摘している。
その解決として、ユーザー意図に基づくタスク設計と、細粒度の挙動解析を可能にする能力志向のアノテーション手法を2本柱とする「VenusBench-Mobile」を提案している。
最新のモバイルGUIエージェントを評価した結果、従来ベンチマークに比べて大きな性能差が見られ、同ベンチがより難しく現実的な課題を提示することが示された。
失敗の主因は知覚と記憶の欠陥に偏っており、粗い評価では見えにくい問題が診断分析で明確になったほか、環境変動下では最強クラスでも成功率がほぼゼロで、脆さ（brittleness）が強調された。
コードとデータが公開されており、頑健な実環境展開に向けた重要な足がかりになると位置づけている。

Abstract

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile.

Black Hat USA

AI Business

Black Hat Asia

AI Business

v0.20.5

Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Dev.to

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

Reddit r/LocalLLaMA

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

v0.20.5

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer