CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

arXiv cs.AI / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

CoMaTrack is introduced as a competitive, game-theoretic multi-agent reinforcement learning framework for Embodied Visual Tracking (EVT), designed to improve adaptive planning and robustness to interference in dynamic adversarial settings.
The work also presents CoMaTrack-Bench, described as the first benchmark for competitive EVT with tracker-versus-opponent game scenarios spanning diverse environments and language instructions to standardize robustness evaluation under active adversarial interaction.
Experiments report state-of-the-art performance on both existing EVT benchmarks and the new competitive benchmark, indicating stronger generalization than prior single-agent imitation learning approaches.
A key result claims that a 3B vision-language-action model trained with CoMaTrack exceeds earlier single-agent imitation learning methods using 7B models on EVT-Bench, with reported scores of 92.1% (STT), 74.2% (DT), and 57.5% (AT).
The benchmark code is planned for release via the provided GitHub repository link, enabling other researchers to reproduce and evaluate against CoMaTrack-Bench.

Abstract

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

I made a new programming language to get better coding with less tokens.

Dev.to

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

I made a new programming language to get better coding with less tokens.

RSA Conference 2026: The Week Vibe Coding Security Became Impossible to Ignore

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer