ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration
arXiv cs.CV / 4/2/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces ONE-SHOT, a parameter-efficient framework for compositional human–environment video synthesis that targets fine-grained, independent editing of subjects and scenes.
- It factorizes generation into disentangled signals using a canonical-space motion injection approach with cross-attention to decouple human dynamics from environmental cues.
- It proposes Dynamic-Grounded-RoPE, a new positional embedding method designed to create spatial correspondences across different spatial domains without relying on heuristic 3D alignment.
- For long-horizon (minute-level) generation, it adds a Hybrid Context Integration mechanism to preserve consistency between the subject and the overall scene.
- Experiments claim significant improvements over state-of-the-art video foundation model methods, yielding better structural control while maintaining creative diversity.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Unitree's IPO
ChinaTalk

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖
Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to
A bug in Bun may have been the root cause of the Claude Code source code leak.
Reddit r/LocalLLaMA