Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

arXiv cs.CV / 5/1/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes Uni-HOI, a unified framework to model the joint distribution of text, human motion, and object motion for 4D human-object interaction (HOI).
Uni-HOI uses large language models (LLMs) together with two motion-specific VQ-VAE modules to convert heterogeneous motion data into token sequences that can be fed into LLMs.
It introduces a two-stage training approach: first multi-task learning on a large-scale HOI dataset to learn cross-modal correlations, then task-specific fine-tuning for better accuracy.
Experiments indicate Uni-HOI can handle multiple HOI-related tasks within one system, including text-driven HOI generation and motion-conditioned human/object motion prediction, optionally with text.

Abstract

Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved promising results on specific HOI tasks-such as text-conditioned HOI generation and human motion generation from object motion, they typically rely on task-specific architectures and lack a unified framework capable of handling diverse conditional inputs. Building on this, we propose Uni-HOI, a unified framework that learns the joint distribution among text, human motion, and object motion. By leveraging large language models (LLMs) and two motion-specific vector quantized variational autoencoders (VQ-VAEs), we convert heterogeneous motion data into token sequences compatible with LLM inputs, enabling seamless integration and joint modeling of all three modalities. We introduce a two-stage training strategy: the first stage performs multi-task learning on a large-scale HOI dataset to capture the underlying correlations among the three modalities, while the second stage fine-tunes the model on specific tasks to further enhance performance. Extensive experiments demonstrate that Uni-HOI achieves remarkable performances on multiple HOI-related tasks including text-driven HOI generation, object motion-driven human motion generation (optionally with text) and human motion-driven object motion prediction within a unified framework.

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Dev.to

GitHub - intel/auto-round: A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

Reddit r/LocalLLaMA

ChatGPT's goblin obsession may be hilarious, but it points to a deeper problem in AI training

THE DECODER

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Key Points

Abstract

Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

GitHub - intel/auto-round: A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

ChatGPT's goblin obsession may be hilarious, but it points to a deeper problem in AI training

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer