A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study

arXiv cs.RO / 4/17/2026

📰 News

Key Points

  • The paper addresses a key limitation in assistive robotics: current interfaces and datasets do not adequately capture multimodal, dialogue-driven ambiguity in natural human-robot interaction.
  • It proposes a multimodal data collection framework using a dialogue-based protocol and a two-room Wizard-of-Oz setup to simulate robot autonomy while encouraging natural user behavior.
  • The system records five synchronized modalities—RGB-D video, conversational audio, IMU signals, end-effector Cartesian pose, and whole-body joint states—across five wheelchair/robot-arm assistance tasks.
  • A pilot dataset of 53 trials from five participants was collected and evaluated using motion-smoothness analysis and user feedback, showing the approach can represent diverse ambiguity types.
  • The authors argue the framework is suitable for scaling to larger datasets to support training, benchmarking, and evaluation of ambiguity-aware assistive control.
  • categories: [

Abstract

Integrated control of wheelchairs and wheelchair-mounted robotic arms (WMRAs) has strong potential to increase independence for users with severe motor limitations, yet existing interfaces often lack the flexibility needed for intuitive assistive interaction. Although data-driven AI methods show promise, progress is limited by the lack of multimodal datasets that capture natural Human-Robot Interaction (HRI), particularly conversational ambiguity in dialogue-driven control. To address this gap, we propose a multimodal data collection framework that employs a dialogue-based interaction protocol and a two-room Wizard-of-Oz (WoZ) setup to simulate robot autonomy while eliciting natural user behavior. The framework records five synchronized modalities: RGB-D video, conversational audio, inertial measurement unit (IMU) signals, end-effector Cartesian pose, and whole-body joint states across five assistive tasks. Using this framework, we collected a pilot dataset of 53 trials from five participants and validated its quality through motion smoothness analysis and user feedback. The results show that the framework effectively captures diverse ambiguity types and supports natural dialogue-driven interaction, demonstrating its suitability for scaling to a larger dataset for learning, benchmarking, and evaluation of ambiguity-aware assistive control.