From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

arXiv cs.RO / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper surveys techniques for learning robot manipulation control interfaces from temporal video without requiring action labels, aiming to bridge video observations and reliable robotic control.
  • It proposes an interface-centric taxonomy that groups approaches into three families: direct video-to-action policies (implicit interfaces), latent-action methods (video mapped through compact learned intermediates), and explicit visual interfaces (predicting interpretable targets for downstream control).
  • For each approach family, the authors analyze how control is integrated into robotics—covering loop closure, what can be verified prior to execution, and where failures typically occur.
  • The cross-family synthesis identifies the main open problem as the robotics integration layer, i.e., the mechanisms that connect video-derived predictions to dependable robot behavior.
  • The paper outlines research directions to close the gap between learned interfaces from video and robust, verifiable execution on robots.

Abstract

Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an \emph{interface-centric taxonomy} organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video--action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit visual interfaces, which predict interpretable targets for downstream control. For each family, we analyze control-integration properties -- how the loop is closed, what can be verified before execution, and where failures enter. A cross-family synthesis reveals that the most pressing open challenges center on the \emph{robotics integration layer} -- the mechanisms that connect video-derived predictions to dependable robot behavior -- and we outline research directions toward closing this gap.