Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation

arXiv cs.LG / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how robots in shared workspaces can become unsafe when a collaborating agent switches behavioral strategy mid-episode and the robot continues under outdated assumptions.
  • In ManiSkill shared-workspace manipulation tasks, across 10 regime-switch detection methods, enabling detection cuts post-switch collisions by 52%, but reliability varies widely depending on the allowed detection tolerance.
  • Under a realistic tolerance of ±3 steps, detection performance ranges from 86% down to 30%, while with a looser ±5 tolerance all methods reach 100%, highlighting practical constraints for deployment.
  • The authors propose UA-TOM, a lightweight belief-tracking module that augments frozen vision-language-action control backbones with selective state-space dynamics, causal attention, and prediction-error signals, improving detection rate (85.7% at ±3) and reducing close-range time (4.8 steps) while outperforming an Oracle in their metric.
  • UA-TOM’s analysis shows regime switches cause a 17x increase in hidden-state update magnitude that decays over ~10 timesteps, with inference overhead of 7.4 ms (14.8% of a 50 ms control budget), and complementary behavior verified in a cross-domain Overcooked experiment.

Abstract

Robots operating in shared workspaces must maintain safe coordination with other agents whose behavior may change during task execution. When a collaborating agent switches strategy mid-episode, continuing under outdated assumptions can lead to unsafe actions and increased collision risk. Reliable detection of such behavioral regime changes is therefore critical. We study regime-switch detection under controlled non-stationarity in ManiSkill shared-workspace manipulation tasks. Across ten detection methods and five random seeds, enabling detection reduces post-switch collisions by 52%. However, average performance hides significant reliability differences: under a realistic tolerance of +-3 steps, detection ranges from 86% to 30%, while under +-5 steps all methods achieve 100%. We introduce UA-TOM, a lightweight belief-tracking module that augments frozen vision-language-action (VLA) control backbones using selective state-space dynamics, causal attention, and prediction-error signals. Across five seeds and 1200 episodes, UA-TOM achieves the highest detection rate among unassisted methods (85.7% at +-3) and the lowest close-range time (4.8 steps), outperforming an Oracle (5.3 steps). Analysis shows hidden-state update magnitude increases by 17x at regime switches and decays over roughly 10 timesteps, while the discretization step converges to a near-constant value (Delta_t approx 0.78), indicating sensitivity driven by learned dynamics rather than input-dependent gating. Cross-domain experiments in Overcooked show complementary roles of causal attention and prediction-error signals. UA-TOM introduces 7.4 ms inference overhead (14.8% of a 50 ms control budget), enabling reliable regime-switch detection without modifying the base policy.