Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation

arXiv cs.CV / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that most visual tracking systems are “fire-and-forget” and proposes Interactive Tracking, where users can steer a tracker at any time using natural-language commands for real human-in-the-loop use cases.
It introduces InteractTrack, a new large-scale benchmark with 150 videos, densely annotated bounding boxes, and timestamped language instructions to support research on interactive tracking.
The authors provide a dedicated evaluation protocol and show that 25 representative state-of-the-art trackers perform poorly in interactive scenarios, indicating that gains on conventional benchmarks do not reliably transfer.
They propose IMAT (Interactive Memory-Augmented Tracking), a baseline that uses dynamic memory to learn from user feedback and update tracking behavior over time.
The benchmark, evaluation assets, and results are published to serve as a foundation for building more adaptive, collaborative tracking systems.

Abstract

Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.