HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

arXiv cs.RO / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

HO-Flow is a new framework for generating realistic 3D hand–object interaction (HOI) motion sequences from text and canonical 3D objects, targeting temporal coherence and physical plausibility.
The method first uses an interaction-aware variational autoencoder to map hand and object motion sequences into a unified latent space by incorporating hand/object kinematics to better capture interaction dynamics.
It then applies a masked flow matching model that blends auto-regressive temporal reasoning with continuous latent generation to improve temporal consistency across frames.
To enhance generalization beyond training data, HO-Flow predicts object motion relative to the initial frame, enabling effective pre-training on large-scale synthetic datasets.
Experiments on GRAB, OakInk, and DexYCB show state-of-the-art results, improving both physical plausibility and motion diversity for interaction synthesis.

Abstract

Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

Dev.to

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

Dev.to

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Key Points

Abstract

Related Articles

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

One URL for Your AI Agent: HTML, JSON, Markdown, and an A2A Card

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer