Learning Additively Compositional Latent Actions for Embodied AI

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses limitations of prior latent-action-learning methods for embodied AI, which often lack priors for the additive, compositional structure of physical motion.
  • It introduces AC-LAM (Additively Compositional Latent Action Model), enforcing scene-wise additive composition constraints over short horizons in the latent action space.
  • The method promotes simple algebraic properties in latent actions—such as identity, inverse, and cycle consistency—while suppressing latent information that does not compose additively.
  • Experiments show that AC-LAM produces more structured, motion-specific, and displacement-calibrated latent actions, improving supervision for downstream policy learning.
  • The authors report state-of-the-art performance across both simulated and real-world tabletop tasks using the learned latent actions.

Abstract

Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.