MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy

arXiv cs.RO / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • MATT-Diff is a diffusion-policy-based control method for active multi-target tracking with a mobile agent that can handle exploration, tracking, and target reacquisition without knowing the number, states, or dynamics of targets in advance.
  • The approach balances uncertainty reduction for detected-but-uncertain targets with exploration for undetected or lost targets, enabling the agent to switch behaviors appropriately.
  • The paper builds a demonstration dataset using three expert planners (frontier-based exploration, an uncertainty-based exploration/tracking switcher, and a time-based exploration/reacquisition switcher) to provide multimodal behavior targets.
  • MATT-Diff uses a vision transformer for egocentric map tokenization and an attention mechanism to fuse variable target estimates modeled as Gaussian densities, learning multimodal action sequences via a diffusion denoising process.
  • Experiments show improved tracking performance over learning-based baselines in new environments, and the multimodal behaviors reflect the diversity of the expert planners; the code is released on GitHub.

Abstract

This paper proposes MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy, a control policy for active multi-target tracking using a mobile agent. The policy enables multiple behavior modes for the agent, including exploration, tracking, and target reacquisition, without prior knowledge of the target numbers, states, or dynamics. Effective target tracking demands balancing exploration for undetected or lost targets with exploitation, i.e., uncertainty reduction, of detected but uncertain ones. We generate a demonstration dataset from three expert planners including frontier-based exploration, an uncertainty-based hybrid planner switching between frontier-based exploration and RRT* tracking, and a time-based hybrid planner switching between exploration and target reacquisition based on target detection time. Our control policy utilizes a vision transformer for egocentric map tokenization and an attention mechanism to integrate variable target estimates represented by Gaussian densities. Trained as a diffusion model, the policy learns to generate multimodal action sequences through a denoising process. Evaluations demonstrate MATT-Diff's superior tracking performance against other learning-based baselines in novel environments, as well as its multimodal behavior sourced from the multiple expert planners. Our implementation is available at https://github.com/CINAPSLab/MATT-Diff.