Multimodal Anomaly Detection for Human-Robot Interaction

arXiv cs.RO / 4/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MADRI, a reconstruction-based anomaly detection framework for human-robot interaction that first converts video streams into semantically meaningful feature vectors.
  • It extends vision-only anomaly detection by fusing visual features with the robot’s internal sensor readings and a Scene Graph to capture both external environmental deviations and internal robot failures.
  • The authors created a custom dataset for a pick-and-place task with both normal and anomalous conditions to evaluate the approach.
  • Results show that reconstructing vision-derived feature vectors can effectively detect anomalies, and that adding additional modalities improves overall detection performance.

Abstract

Ensuring safety and reliability in human-robot interaction (HRI) requires the timely detection of unexpected events that could lead to system failures or unsafe behaviours. Anomaly detection thus plays a critical role in enabling robots to recognize and respond to deviations from normal operation during collaborative tasks. While reconstruction models have been actively explored in HRI, approaches that operate directly on feature vectors remain largely unexplored. In this work, we propose MADRI, a framework that first transforms video streams into semantically meaningful feature vectors before performing reconstruction-based anomaly detection. Additionally, we augment these visual feature vectors with the robot's internal sensors' readings and a Scene Graph, enabling the model to capture both external anomalies in the visual environment and internal failures within the robot itself. To evaluate our approach, we collected a custom dataset consisting of a simple pick-and-place robotic task under normal and anomalous conditions. Experimental results demonstrate that reconstruction on vision-based feature vectors alone is effective for detecting anomalies, while incorporating other modalities further improves detection performance, highlighting the benefits of multimodal feature reconstruction for robust anomaly detection in human-robot collaboration.