EvolvingAgent: Curriculum Self-evolving Agent with Continual World Model for Long-Horizon Tasks

arXiv cs.RO / 4/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces EvolvingAgent, an embodied long-horizon agent designed to autonomously complete tasks in open-ended worlds without human intervention.
  • It targets two limitations of prior work: dependence on human-curated curricula and multimodal data selection, and failure to continually update world knowledge due to catastrophic forgetting.
  • EvolvingAgent uses a three-module closed-loop system: an LLM-based task planner, a world-model-guided action controller with self-verification to update multimodal experiences, and a curriculum-learning reflector that selects experiences for task-adaptive world model updates.
  • Experiments on Minecraft show a large gain in average success rate (up to 111.74%) and a major reduction in ineffective actions (over 6×), and the approach generalizes to Atari with human-level performance.
  • Overall, the work demonstrates that continual multimodal world modeling combined with self-planning, control, and reflection can significantly improve long-horizon embodied performance.

Abstract

Completing Long-Horizon (LH) tasks in open-ended worlds is an important yet difficult problem for embodied agents. Existing approaches suffer from two key challenges: (1) they heavily rely on experiences obtained from human-created data or curricula, failing to autonomously update and select multimodal experiences, and (2) they may encounter catastrophic forgetting issues when faced with new tasks, failing to autonomously update world knowledge. To solve these challenges, this paper presents {\bf EvolvingAgent}, a curriculum self-evolving agent with a continual World Model (WM), which can autonomously complete various LH tasks across environments through self-planning, self-control, and self-reflection, without human intervention. Specifically, EvolvingAgent contains three modules, i.e., i) the experience-driven task planner, which uses an LLM along with multimodal experiences to convert LH tasks into executable sub-tasks; ii) the WM-guided action controller, which leverages WM to generate low-level actions and incorporates a self-verification mechanism to update multimodal experiences; iii) the Curriculum Learning (CL) -based reflector, which implements a two-stage CL algorithm to select multimodal experiences for task-adaptive WM updates. By building a planner-controller-reflector closed-loop dynamic, the continual WM for EvolvingAgent can autonomously update multimodal experiences and world knowledge. We conducted extensive experiments on Minecraft, compared with existing methods, EvolvingAgent can improve 111.74{\%} average success rate, reduce more than 6x ineffective actions, and generalize to the Atari environment with human-level performance.