UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

arXiv cs.LG / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces UI-Copilot, a framework for multi-modal/LLM-based GUI agents that targets long-horizon failures such as memory degradation, progress confusion, and math hallucination.
It uses a collaborative design where the main GUI agent handles execution while a lightweight copilot provides on-demand memory retrieval and numerical computation.
The method proposes memory decoupling to separate persistent observations from transient execution context, improving continuity over extended task sequences.
It trains the policy agent to selectively invoke the copilot as a Retriever or Calculator, using Tool-Integrated Policy Optimization (TIPO) that optimizes tool selection (single-turn) and execution (on-policy multi-turn).
Results report state-of-the-art performance on MemGUI-Bench and a 17.1% absolute improvement on AndroidWorld versus a base Qwen model, indicating strong generalization to real-world GUI tasks.

Abstract

MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/16DailyView insight →

Black Hat Asia

AI Business

The AI Hype Cycle Is Lying to You About What to Learn

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

Dev.to

Factory hits $1.5B valuation to build AI coding for enterprises

TechCrunch

UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

The AI Hype Cycle Is Lying to You About What to Learn

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

OpenAI Codex April 2026 Update Review: Computer Use, Memory & 90+ Plugins — Is the Hype Real?

Factory hits $1.5B valuation to build AI coding for enterprises

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer