Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model

arXiv cs.CV / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper studies whether facial-expression-derived emotion embeddings can improve short-horizon human pose prediction, especially for emotion-driven motion dynamics that geometric cues alone may miss.
It proposes a lightweight autoregressive “predictive world model” that performs 15-step rolling pose forecasts by combining pose keypoints and emotion embeddings via a learnable gating mechanism and using a two-layer LSTM recurrent model.
Experiments on two small pose–emotion video datasets (controlled and natural, with larger facial expression changes) find that naive multimodal fusion does not reliably boost accuracy, while normalized gating fusion significantly improves performance for emotion-driven sequences.
Counterfactual perturbation tests show the predicted pose trajectory changes measurably when multimodal inputs are altered, indicating the emotion embeddings serve as meaningful auxiliary conditional signals rather than redundant information.

Abstract

Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments were conducted on two small-scale pose-emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion-driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion-driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash

Reddit r/LocalLLaMA

Record $1.1B Seed Funding for Reinforcement Learning Startup

AI Business

The One Substrate Failure Behind Every AI System in 2026

Reddit r/artificial

Into the Omniverse: Manufacturing’s Simulation-First Era Has Arrived

Nvidia AI Blog

Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model

Key Points

Abstract

Related Articles

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash

Record $1.1B Seed Funding for Reinforcement Learning Startup

The One Substrate Failure Behind Every AI System in 2026

Into the Omniverse: Manufacturing’s Simulation-First Era Has Arrived

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer