UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors
arXiv cs.CV / 3/18/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- UMO provides a unified framework that casts diverse downstream motion generation tasks into compositions of per-frame operations to leverage pretrained motion foundation models.
- It introduces three learnable frame-level meta-operation embeddings and a lightweight temporal fusion method to inject in-context cues with negligible runtime overhead.
- By finetuning the pretrained DiT-based motion LFMs, UMO supports tasks previously unsupported, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation.
- Experimental results show UMO consistently outperforms task-specific and training-free baselines across benchmarks.
- The authors will release code and model publicly with a project page for follow-up use and evaluation.




