Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

arXiv cs.RO / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets open-world robotic manipulation by improving cross-task generalization through extracting transferable manipulation skills from previously seen tasks.
  • It argues that existing in-context learning methods mainly provide continuous action trajectories, which leads to superficial imitation rather than composable skill transfer.
  • The proposed “Decompose and Recompose” framework represents knowledge as atomic skill–action pairs, first decomposing demonstrations into interpretable skill-action alignments and then recomposing them for unseen tasks via compositional reasoning.
  • It builds a task-adaptive dynamic demonstration library using visual-semantic retrieval plus skill sequences from a planning agent, and uses a coverage-aware static library to supply missing skill patterns.
  • Experiments on the AGNOSTOS benchmark and real-world setups reportedly confirm strong zero-shot cross-task generalization results.

Abstract

Cross-task generalization is a core challenge in open-world robotic manipulation, and the key lies in extracting transferable manipulation knowledge from seen tasks. Recent in-context learning approaches leverage seen task demonstrations to generate actions for unseen tasks without parameter updates. However, existing methods provide only low-level continuous action sequences as context, failing to capture composable skill knowledge and causing models to degenerate into superficial trajectory imitation. We propose Decompose and Recompose, a skill reasoning framework using atomic skill-action pairs as intermediate representations. Our approach decomposes seen demonstrations into interpretable skill--action alignments, enabling the model to recompose these skills for unseen tasks through compositional reasoning. Specifically, we construct a task-adaptive dynamic demonstration library via visual-semantic retrieval combined with skill sequences from a planning agent, complemented by a coverage-aware static library to fill missing skill patterns. Together, these yield skill-comprehensive demonstrations that explicitly elicit compositional reasoning for skill composition and execution ordering. Experiments on the AGNOSTOS benchmark and real-world environments validate our method's zero-shot cross-task generalization capability.