RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces RemoteAgent, an agentic framework for Earth Observation that converts vague human natural-language intents into the right granularity of visual analysis tasks.
  • It argues that while MLLMs excel at semantic understanding, their text outputs are inefficient for dense, precision-critical spatial predictions, so the system should decide when to act internally vs. when to use external tools.
  • RemoteAgent uses the newly built VagueEO dataset (EO tasks paired with simulated vague queries) and applies reinforcement fine-tuning to improve intent recognition and task execution.
  • The framework orchestrates specialized tools through the Model Context Protocol (MCP) only for dense predictions, aiming to reduce unnecessary tool calls and better utilize the MLLM’s strengths.
  • Experiments reportedly show RemoteAgent achieves strong intent recognition and competitive performance across a range of EO tasks requiring both image-level and sparse/dense predictions.

Abstract

Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.