CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

arXiv cs.AI / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • CoMaTrack is introduced as a competitive, game-theoretic multi-agent reinforcement learning framework for Embodied Visual Tracking (EVT), designed to improve adaptive planning and robustness to interference in dynamic adversarial settings.
  • The work also presents CoMaTrack-Bench, described as the first benchmark for competitive EVT with tracker-versus-opponent game scenarios spanning diverse environments and language instructions to standardize robustness evaluation under active adversarial interaction.
  • Experiments report state-of-the-art performance on both existing EVT benchmarks and the new competitive benchmark, indicating stronger generalization than prior single-agent imitation learning approaches.
  • A key result claims that a 3B vision-language-action model trained with CoMaTrack exceeds earlier single-agent imitation learning methods using 7B models on EVT-Bench, with reported scores of 92.1% (STT), 74.2% (DT), and 57.5% (AT).
  • The benchmark code is planned for release via the provided GitHub repository link, enabling other researchers to reproduce and evaluate against CoMaTrack-Bench.

Abstract

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench