Iterative Compositional Data Generation for Robot Control

arXiv cs.RO / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the high cost of collecting robotic manipulation demonstrations and argues that existing generative methods fail to leverage the compositional structure of multi-object, multi-robot, and multi-environment task spaces.
  • It proposes a semantic compositional diffusion transformer that decomposes robot dynamics into robot-, object-, obstacle-, and objective-specific components and uses attention to learn how these factors interact.
  • The model is trained on a limited subset of tasks and then performs zero-shot generation of transition data for unseen task combinations, enabling learning of control policies in those new settings.
  • An iterative self-improvement loop validates synthetic transitions using offline reinforcement learning and feeds the validated data back into further training rounds.
  • Results indicate substantially improved zero-shot performance versus monolithic and hard-coded compositional baselines, with the method solving nearly all held-out tasks and suggesting compositional structure emerges in learned representations.

Abstract

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.