Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

arXiv cs.RO / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a lightweight transformer for robot co-speech gesture generation that uses text and emotion to predict iconic gesture placement and intensity.
  • Unlike many data-driven approaches that rely on rhythmic, beat-like motion or audio, the method requires no audio input during inference.
  • The model is evaluated on the BEAT2 dataset and is reported to outperform GPT-4o on semantic gesture placement classification and on intensity regression.
  • The authors emphasize the approach is computationally compact, making it suitable for real-time deployment on embodied agents.

Abstract

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.