From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
arXiv cs.RO / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper surveys techniques for learning robot manipulation control interfaces from temporal video without requiring action labels, aiming to bridge video observations and reliable robotic control.
- It proposes an interface-centric taxonomy that groups approaches into three families: direct video-to-action policies (implicit interfaces), latent-action methods (video mapped through compact learned intermediates), and explicit visual interfaces (predicting interpretable targets for downstream control).
- For each approach family, the authors analyze how control is integrated into robotics—covering loop closure, what can be verified prior to execution, and where failures typically occur.
- The cross-family synthesis identifies the main open problem as the robotics integration layer, i.e., the mechanisms that connect video-derived predictions to dependable robot behavior.
- The paper outlines research directions to close the gap between learned interfaces from video and robust, verifiable execution on robots.
Related Articles

Black Hat Asia
AI Business

Meta's latest model is as open as Zuckerberg's private school
The Register

AI fuels global trade growth as China-US flows shift, McKinsey finds
SCMP Tech

Why multi-agent AI security is broken (and the identity patterns that actually work)
Dev.to
BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.
Reddit r/artificial