UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

arXiv cs.CV / 3/18/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

UMO provides a unified framework that casts diverse downstream motion generation tasks into compositions of per-frame operations to leverage pretrained motion foundation models.
It introduces three learnable frame-level meta-operation embeddings and a lightweight temporal fusion method to inject in-context cues with negligible runtime overhead.
By finetuning the pretrained DiT-based motion LFMs, UMO supports tasks previously unsupported, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation.
Experimental results show UMO consistently outperforms task-specific and training-free baselines across benchmarks.
The authors will release code and model publicly with a project page for follow-up use and evaluation.

Abstract

Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: https://oliver-cong02.github.io/UMO.github.io/

The Complete Guide to AI Prompts for Content Creators

Dev.to

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

From Piles to Protocol: AI for Vendor Compliance at Scale

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

I Was Wrong About AI Coding Assistants. Here's What Changed My Mind (and What I Built About It).

Dev.to

UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

Key Points

Abstract

Related Articles

The Complete Guide to AI Prompts for Content Creators

Automating the Chase: AI for Festival Vendor Compliance

From Piles to Protocol: AI for Vendor Compliance at Scale

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

I Was Wrong About AI Coding Assistants. Here's What Changed My Mind (and What I Built About It).

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer